Status

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

There were many discussions [1][2][3] w.r.t. the (API?) compatibility issues. The feedback coming from the Flink user's point of view is very valuable. In the discussions, there were many explanations that talked about backward and forward compatibility and broke down to API and ABI compatibility. Since each time when a developer referred to compatibility, there was an implicit context behind, which might cause confusion, e.g. the forward compatibility mentioned in the API compatibility discussion thread[1] was actually the Flink ABI backward compatibility mentioned in FLIP-196. The original requirement posted in the API compatibility discussion thread[1] was actually Flink ABI forward compatibility, afaik. I will explain it below. They were all correct because they talked from different perspectives. But it was hard for audiences to follow it and hard for Flink developer to catch users' requirement and provide the appropriate solution.

Compatibility Requirement

To make sure everyone is on the same page, I would suggest clarifying the definition of API and ABI, backward and forward compatibility, understanding what users really need at First, and then proposing changes based on each users' requirement.

Software Backward Compatibility

The definition of software backward compatibility goolge suggested is: In the current computer world, backward compatibility refers to the ability of a software compiler for one version of the language to accept programs or data that worked for the previous version.[5] The example of using the game industry is quite easy to understand. I would use the more Flink feasible example -  the Xbox backward compatibility which means the older games built with/for Xbox or Xbox 360 can run on Xbox one. This is almost exactly the same case Flink has.

Flink Backward Compatibility

From Flink perspective, backward compatibility means Flink jobs or ecosystems like external connectors/formats built with older Flink version X like 1.14 can be running on a newer Flink version Y like 1.15 with no issue. This is as same as the forward compatibility from the Flink job perspective mentioned in [1], which turns out Flink backward compatibility equals Flink job/ecosystems forward compatibility. This is one part of confusion when we communicated w.r.t. the compatibility issue.

Flink Forward Compatibility

Based on the previous clarification, Flink forward compatibility should mean that Flink jobs or ecosystems like external connectors/formats built with newer Flink version Y like 1.15 can be running on an older Flink version X like 1.14 with no issue. With this definition, we might realize that the original requirement in the thread [1] was asking for Flink forward compatibility, i.e. iceberg-flink module compiled with Flink 1.13 wanted to be running on Flink 1.12 cluster. And from Flink user perspective, this is a backward compatibility issue. This is the same case: Flink forward compatibility equals Flink job/ecosystems backward compatibility.

API and ABI compatibility

Most information about API compatibility could find on the internet suggested by google are about RESTful API which means a client (a program written to consume the API) that can work with one version of the API can work the same way with future versions of the API [6].  FLIP-196 makes it clear with the focus of Flink or programmatic API: a program written against a public API will compile w/o errors when upgrading Flink [4]. Based on the information given by FLIP-196, Flink provides API backward compatibility except some corner cases with the RESTful API. IMHO, this is great but does not fulfill users requirements, because users want their Flink jobs to be running without errors, not just compile them, which turns out that what users actually required is the ABI compatibility [7], e.g. connectors build with Flink 1.13 want to be running on Flink 1.12 Cluster with no issue. That is why I mentioned at the beginning that the original request in the API compatibility thread [1] was actually asking for Flink ABI forward compatibility, which Flink didn't provide.

The major issues that make upgrading Flink so difficult for users discussed in [2] are:

1 Downstream vendors like iceberg want to upgrade Flink while some of their users are still using an older Flink.

2 Big companies who have internal Flink Fork have to migrate their internal features for each Flink upgrade.

3. The migration phase of tens of thousands of Flink jobs request "shuffle" running combinations to validate the result, i.e.  jobs built with older Flink running on new Flink, jobs built with new Flink running on old Flink etc.

The first one requests Flink ABI forward compatibility. The second one requests at least Flink ABI backward compatibility. Depending on the effort of the internal feature development, there might be further requirements to develop the feature with the new Flink but use it in the current old Flink cluster during the migration phase, which turns out to be a Flink ABI forward compatibility requirement. The third one requests both Flink ABI backward compatibility and Flink ABI forward compatibility.

Summary

Before we move to the proposed changes section, let's do a short summary. To simplify the communication, I would suggest using compatibility as the synonym of ABI compatibility and refer to API compatibility only when we want to emphasize the context.

  • Compatibility means ABI compatibility implicitly
  • Backward/forward compatibility means Flink backward/forward compatibility implicitly. We as Flink developers will talk about compatibility from Flink perspective not from user perspective.
  • Flink users' program backward compatibility equals Flink forward compatibility and vice versa
  • [What Flink has]Flink provides partial API compatibility except some corner cases with RESTful API
  • [What users need]Flink common end users request Flink backward compatibility
  • [What users need]Flink ecosystem, downstream forks, and critical jobs request both Flink forward compatibility and backward compatibility.

Proposed Changes

From the above explanation, we might be aware of the gap between what Flink currently provides and what users really need. There is already some progress w.r.t. it, like FLIP-196 [4], FLIP-197 [8], external connector repository [9]. Many thanks for the effort.

As we all know, making software backward compatible is difficult. Making it forward compatibility is even harder and very expensive, because we don't have the crystal ball to know the future Flink as we build the present one. A fully forward compatible software will also heavily slow down releasing new features, which is risky for the software to become obsolete. We should find a good trade-off between 0(100% compatibility and obsolete) and 1(break things and risky).

Since issues about Flink backward compatibility and forward compatibility have different business scenarios and different user groups, it would be better to split them into two parts and make it comfortable for users to follow.

Flink Backward Compatibility

FLIP-196 [4] and FLIP-197 [8] have done a great job to define the rules of how API can change over time, let new APIs grow into @Public asap but in a safe way. and make it transparent for any graduation miss. As far as I understood, these are focusing on Flink backward compatibility - to make sure Flink jobs, connectors, and internal features of downstream forks could migrate to the new Flink version without much effort and, in ideal case, with no effort. Changes like CatalogTable to ResolvedCatalogTable in the @PublicEnvolving interface mentioned in the thread [1] should not be allowed to be a replacement but an increment, i.e. @Public and @PublicEnvolving interface should ideally provide backward compatibility guarantee as more as possible.

If, for any reason, we have to break the backward compatibility for a new release, migration tools as the compromise should be provided, e.g. as a Flink sub module for the next release to help users migrate their features, jobs, savepoints, config properties, etc. to the new Flink version without much effort and better with no effort more than like running the tools and waiting for the result. This could help users quickening their upgrade pace. An example of such sub module could look like:


Sub module structure
- flink-migration
  - flink-migration-savepoint
  - flink-migration-checkpoint
  - flink-migration-metrics
  - flink-migration-connectors


Flink Forward Compatibility

This is the hard part. There are very few software that could provide this and normally, even if they could, would only for a short period of time. Back to the Xbox example, it means that the games built with all new features of Xbox one need to be able to play on the old Xbox. 


This discussion is limited to the compatibility between different minor versions. Further discussions are required to check whether compatibility between different major versions has true business value.

Annotation extension

One simple solution, which is also the prerequisite for building the new ability mentioned blow in the upcoming section, is to make it clear enough in the code base for the users(downstream developers) to know which new features/APIs are provided or changed in the new Flink version. The users are responsible to, e.g. if they need Flink forward compatibility, not use any new features while compiling their program with the new Flink and want to run it in an old Flink Cluster.

A simple idea is to extend the @PublicEvolving interface (maybe @Public too) with additional fields. Similar to what FLIP-197 suggested for its own purpose.

Backward compatibility Annotation
@Target(ElementType.TYPE)
public @interface PublicEvolving {
   FlinkVersion backwardCompatible();
   FlinkVersion lastChange();
}

// Usage
@PublicEvolving(backwardCompatible = FlinkVersion.V1_12_0, lastChange = FlinkVersion.V1_13_0)
public class Foo {}

@PublicEvolving(backwardCompatible = FlinkVersion.V1_13_0, lastChange = FlinkVersion.V1_13_0)
public class Bar {}


For this example, with the extension field backwardCompatible, program that used the class Foo can be compiled and run with any Flink version who is greater than or equal to 1.12, i.e. users can use Foo to provide the backward compatibility(which is Flink forward compatibility) of their own program to 1.12. But using Bar will break it. 

The lastChange field show the API stability and could provide hint for the API to be graduated. Depending on the graduation rule, we will have a clear picture of the current status of all @PublicEvolving interface. For example, if the rule is to graduate a @PublicEvolving interface after two unchanged versions

ClassbackwardCompatiblelastChangelast_releaseStatus
FooFlinkVersion.V1_12_0FlinkVersion.V1_13_0




FlinkVersion.V1_15_0

go public for 1.16 release
BarFlinkVersion.V1_13_0FlinkVersion.V1_13_0go public for 1.16 release
BazFlinkVersion.V1_12_0FlinkVersion.V1_14_0mature
QuxFlinkVersion.V1_12_0FlinkVersion.V1_15_0under development
QuuxFlinkVersion.V1_15_0FlinkVersion.V1_15_0unstable
CorgeFlinkVersion.V1_12_0FlinkVersion.V1_12_0Warning, there must be some specific issue to postpone the graduation


If we want fine-grained control, we could also consider adding them at method level.

One step further, tools could be provided for users to analyze the backward compatibility of their programs for any specific Flink version, i.e. the biggest found backwardCompatible version is the oldest Flink version their program can backward compatible to.

Compatibility control

Forward compatibility requires, technically to say, while running on the old Flink, new features built in the program based on new Flink features will be ignored or downgraded. This requires Flink to have the ability to recognise new features beyond its present version and then either isolate them from any consuming or provide downgrade alternatives if there were any. At this level, users are free to use any new features, Flink will take the responsibility of the forward compatibility when the program is running. If we go for this part we need more further discussions.

Compatibility, Deprecation, and Migration Plan

  • We would have to extend the existing stability annotation with new fields.
  • We would have to upgrade all usages of the annotations in the code base.
  • We would have to provide compatibility analysis tools so that users can have benefits for the upgrade effort.
  • We would have to provide a tool to report the compatibility matrix as a big picture.
  • We would consider providing tool support reduce the maintenance effort. A change of one class's backwardCompatible need to make sure all depending classes have the same change. This go more complicated if fine-grained control at method level is required. But this should be done with care, because developer should be aware of any breaking changes and totally understand the impact at every corner. Tool automation might make some important information unnoticeable. 
  • As bootstrap, for the next upcoming version that should support this feature, these two fields should be added into all used @PublicEvolving with the next release version as the default value.

Test Plan

  • We would have to add test to ensure there is no conflict between backwardCompatible and lastChange fields of each class.
  • We would ideally need test to ensure no violation of the versions in the dependency matrix.
  • We would build pre-production env and some sample cases to test the backward/forward compatibility, e.g. a dummy connector or Flink job built with new Flink could run in the older Flink cluster, whose version is greater than or equal to backwardCompatible version.
  • If @Public interface supports the backwardCompatible, we would have to make sure the backwardCompatible version is greater than or equal to the oldest version we support. Some rule might also be considered for @PublicEvolving interface.

Follow ups

  • We would have to further discussion about the Flink forward compatibility control.

References

[1] https://lists.apache.org/thread/kzhfc3t6omzo2kyo8zj9yxoh8twq5fzr

[2] https://lists.apache.org/thread/5osq7loyx5cstsdflw6smtx2x1lw7dk7

[3] https://lists.apache.org/thread/gkczh583ovlo1fpj7l61cnr2zl695xkp

[4] https://cwiki.apache.org/confluence/display/FLINK/FLIP-196%3A+Source+API+stability+guarantees

[5] https://eandt.theiet.org/content/articles/2011/09/software-backwards-compatibility/

[6] https://www.infoworld.com/article/3401920/how-to-make-your-rest-apis-backward-compatible.html#:~:text=API compatibility example,being consumed by different clients.&text=An API is backward compatible,future versions of the API.

[7] https://www.bensnider.com/abi-compatibility-whoopdty-do-what-does-it-all-mean.html

[8] https://cwiki.apache.org/confluence/display/FLINK/FLIP-197%3A+API+stability+graduation+process

[9] https://lists.apache.org/thread/frlkh0vftfzox95zdwtk116vplo3xmg9