...
Current state: Under Discussion
Discussion thread: here [Change the link from the KIP proposal email archive to your own email thread]
JIRA: Jira server ASF JIRA columns key,summary,type,created,updated,due,assignee,reporter,priority,status,resolution serverId 5aa69414-a9e9-3523-82ec-879b028fb15b key KAFKA-10627
Proposed Pull Request: #9492#11523
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
...
The TimestampConverter
transform only allows to convert one field at a time for each usage of the transform (by use of the field
configuration parameter). But in a real environment you will often have multiple timestamps on an event (such as Created On, Last Updated On, Approved On, etc), and if you are in a position that one of them need to be converted using TimestampConverter
then probably more than one (if not all of them) need to be transformed. For large messages which may already be going through multiple other transforms, then the performance goes down quite a bit if you end up chaining more than just a few TimestampConverter
transforms just to catch all of the different fields.
...
- Change the configuration parameter
field
to be calledfields
since it will now support multiple comma-separated field names (but can support backward compatibility for some time). - Add new configuration parameters
format.input
to allow for a pattern format which supports multiple variations to parse a string, andformat.output
to specify the exact string format to output in the case of converting from a Date/Time to a string. - The configuration parameter
format
could possibly be removed at a later date (but remain remains for now for backwards compatibility), or could also be used to specify both bothformat.input
andformat.output
at the same time for more simple scenarios (assuming you just have a single string input format). - As general housekeeping, the
TimestampConverter
class should also be updated at the same time to include publicConfigName
andConfigDefault
interfaces instead of various public string class attributes for the configuration properties, similar to what has been done in several of the other SMTs (likeReplaceField
for example).
Proposed Changes
Supporting Multiple Fields
For supporting multiple fields, we can create a new configuration parameter called fields
which is of type change the field
property from a single string to a type ConfigDef.Type.LIST
.
And then for actually performing the transformation on multiple fields, instead of checking field.name().equals(config.field)
like seen in applyValueWithSchema
:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
private Struct applyValueWithSchema(Struct value, Schema updatedSchema) {
//...
for (Field field : value.schema().fields()) {
final Object updatedFieldValue;
if (field.name().equals(config.field)) {
//... |
... it should check if the new fields
configuration parameter contains the field name. The rest of this logic can be the same.
And in applySchemaless
it should instead do a for-each loop on the entries of the Map instead of just doing a put
based on the old single field
name.
Instead of this:
Something like this:
...
language | java |
---|---|
title | TimestampConverter.java |
...
and for clarity and consistency the property should be renamed to fields
instead. Then when performing the translation, the code can apply the transformation to all fields in the list instead of just the one field specified in the old field
property.
Supporting Multiple String Input Formats
For output of a Date/Time field to a string, then it must continue to be given in an exact format. So because a single specific format, not in some kind of pattern. Because of this, we need to separate the format configuration parameter into two: one parameter for output to strings with an exact format, and one parameter for input format of strings to be parsed into the target.type
that can support a pattern of different variations of the string-based date or timestamps.
So now To support this, there will be two new parameters added: format.input
and format.output
.
It is possible also to keep the The existing format
parameter can also remain in place , and sort of allow it as a "single" to allow for configuration which will provide both the input and output formats at the same time, and work exactly as it did before this change. In this scenario, it would not support multiple different input formats (so again, the same as before). But it should not allow to set a mix of both the old and the new format parameters.
In order to support multiple input patterns the suggestion is to make the string to target type parsing use some of the features of java.time
such as DateTimeFormatter
instead of relying on the much older and more limited java.text.SimpleDateFormat
. java.time
was added in Java 8 which as I understand is the oldest version of Java supported by Kafka, so it should be fine to introduce these newer libraries I think.not add any new dependency if we wish to introduce its usage in Kafka.
The new format.input
property will require a regular expression-like string that is compatible with the JDK's DateTimeFormatter.ofPattern()
methodWith DateTimeFormatter
then we can simply use the ofPattern
method to build an input formatter using a regex-like pattern string that can be given in the new format.input
configuration parameter. For example patterns like this would be supported: "[yyyy-MM-dd[['T'][ ]HH:mm:ss[.SSSSSSSz][.SSS[XXX][X]]]]"
This also means that within the Config
instance there should be two separate formatters for intput and output (instead of just one called Config.format
today).
The output formatter can work exactly the same as today, basically...
Code Block | ||||
---|---|---|---|---|
| ||||
SimpleDateFormat outputFormat = new SimpleDateFormat(outputFormatPattern);
outputFormat.setTimeZone(UTC); |
But the input formatter will come from DateTimeFormatter
instead. Something like this:
Code Block | ||||
---|---|---|---|---|
| ||||
DateTimeFormatter inputFormat = DateTimeFormatter.ofPattern(inputFormatPattern).withZone(ZoneOffset.UTC); |
Then there will need to be changes in the TRANSLATORS
Map which actually performs the conversion (via call to convertTimestamp()
). Namely:
...
language | java |
---|---|
title | TimestampConverter.java |
...
example pattern would be able to support and successfully parse if there are multiple different formats in the same field, including:
2021-11-22
2021-11-22 11:19:45
2021-11-22T11:19:45
2021-11-22T11:19:45.000Z
- and more...
Compatibility, Deprecation, and Migration Plan
- What impact (if any) will there be on existing users?
If the The transform configuration parameter field
is renamed to fields
then users will need to update their connectors.If the transform configuration parameter format
is removed and instead you must specify both will be renamed to fields
but should be done so in a way that adoption is voluntary until a major version deprecation can occur (e.g. that field
still works and backward compatibility is maintained).
The transform configuration parameter format
will continue to function as before (as a SimpleDateFormat
pattern for both input and output strings) and if users wish to use the DateTimeFormatter
input format they will need to use new input and output specific parameters format.output
and format.input
, then users will need to update their connectors instead.
- If we are changing behavior how will we phase out the older behavior?
We can either leave the existing Existing configuration parameters how and public class strings will be left as they are and make them "fit" into the new parameters, or they can be deprecated after a few versionscontinue to function as they do, but will be marked and described as deprecated and can be fully removed if and when it is appropriate in a future major release.
- If we need special migration tools, describe them here.
No migration tool should be necessary, users ; if users wish to begin using the new features they will just need to update their config files or send a PUT request to the Connect REST API to update the configuration of connectors which are using the TimestampConverter
transformconnector configurations.
- When will we remove the existing behavior?
Assuming that it will be The deprecated configuration parameter field
and the public configuration-related class strings will be removed after the next major Kafka release, based on the standard : "2 versions later"deprecation practice for the Kafka project.
Rejected Alternatives
One initial thought was to change the entire transform from using java.util.Date
to instead use java.time
classes instead. However, after a bit of investigation I quickly found that since Kafka and Connect have a huge list of dependencies on dates and times being a java.util.Date
, then it quickly became apparent that the easiest thing to do would be to focus on the core problem: parsing strings into a Date
in a smarter way with the help of something like DateTimeFormatter
. and then continue returning a Date
for use by the rest of Connect.
...