Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Single Message Transforms (SMT), KIP-66, have greatly improved Connector's usability by enabling processing input/output data without the need for additional streaming applications. 

Though, these These benefits have been limited as most SMT implementations are by SMTs limited to fields available on the root structure:

Therefore, this This KIP is aimed to include support for nested structures on the existing SMTs  where this makes sense and to include the abstractions to reuse this in future SMTs.

...

From the existing list of SMTs, there are the following to be impacted by this change:

New configuration flags

...

NameTypeDefaultImportanceDocumentation
transforms.field.style 

...

Accepted values:

...

STRING plain HIGH

Permitted values: plain , nested. Defines how to traverse a record structure to apply a transformation. If set to "root", then the transformations will only apply to the elements located at the root of the message. If set to "nested", then nested elements (accessed by "field.separator") will be affected by the transformations as well.

transforms.field.separator STRING . LOW 

Permitted values: ., /. When defining the path to a field, this separator determines this path is divided into parent and child elements. If set to ".", then a path "parent.complex.element" will access the parent "root" struct, then the "complex" struct, to apply the transformation to the "element". If the default value collides with the element names used in the record, then it can be changed to one of the other 3 alternative values

...

.

SMTs affected

Extending the support for field configuration for dotted separation:

  • Cast: extend spec to support nested notation.
  • ExtractField: extend  thefield to support nested notation.
  • HeaderFrom: extend  thefields list to support nested notation.
  • MaskField: extend  thefields list to support nested notation.
  • ReplaceField: extend  theinclude and exclude lists to support nested notation.
  • TimestampConverter: extend  thefield to support nested notation.
  • ValueToKey: extend  thefields list to support nested notation.
  • InsertField: Extend field configs to support nested notation.

Will require additional configurations:

  • HoistField: add a hoisted config to point to a specific path to hoist.  

    NameTypeDefaultImportanceDocumentation
    hoisted STRING <empty>MEDIUM Path to the element to be hoisted. If empty, the root struct is hoisted.


    • For example:

      Code Block
         hoisted = nested.val
         field = line
      
         value (before):
         {
           "nested": {
             "val": 42,
             "other val": 96
           }
         }
      
         value (after):
         {
           "nested": {
             "line": {
               "val": 42,
             },
           "other val": 96
           }
         } 


These SMTs SMT do not require nested structure support:

  • DropDropHeaders: Drop the whole key or valueone or multiple headers.
  • Filter: Drops the whole message based on a predicate.
  • InsertHeader: Insert a specific message to the header.
  • RegexRouter: Acts on the topic name.
  • SetSchemaMetadata: Acts on root schema.
  • TimestampRouter: Acts on timestamp.
  • Flatten: Acts on the whole key or message. 

Proposed Changes

Nested notation

Dotted notation nested.key Using dots tends to be the most natural intuitive way to describe nested fields as part of the configuration. access the nested record structures, e.g. jq tooling already uses it https://stedolan.github.io/jq/manual/#Basicfilters and will cover most of the scenarios.

Dots are already allowed as part of element names on JSON (i.e. Schemaless) recordsThough, schemaless (Map<String, Object) records can have a dotted notation included on their field names (e.g. {'nested.key': {'val':42}}).

As the scenarios where the dotted notation is used on JSON messages could be rare, this KIP proposed to stick with dots as separators.

For scenarios where dotted notations are present on JSON messages, an escape backlash approach is proposed:

. Instead of escaping them with backslashes, which in JSON configurations will lead to unfriendly configurations, it's proposed to offer a configuration to switch to another separator.

If users recognize that their field names include dots or other separators, they could define another one to simplify their configuration.

Compatibility, Deprecation, and Migration Plan

Existing SMT configurations will not be affected by these changes as the default field.style  is plain, which represents the current behavior.

Rejected Alternatives

Keep ExtractField as it is and use it multiple times until reaching nested fields

This KIP proposes to simplify this configuration by replacing multiple invocations with only one nested one.

Use dots as the only separator and escape with backslashes when collides

Trying to keep only one separator, one of the alternatives is to use dots to separate; if it collides with the existing field names use backslashes "\" to represent dots that are part of the name e.g.  "this.field" (which would refer to the nested field "field"

...

under the top-level "this" field)

...

, and "this\.field" (which would refer to the field named "this.field").

Compatibility, Deprecation, and Migration Plan

Existing SMT configurations should work fine unless they are using schemaless JSON records relying on dotted notation. This will need to be assessed as part of the KIP discussion.

If further requests to support other values arrive, we should consider extending the configuration with a nested delimiter that should be restricted to a set of few values.

Renaming fields SMT could also be used as a workaround to replace dot-named fields on JSON messages.

Rejected Alternatives

Keep ExtractField as it is and repeat it until reaching nested fields

However, backslashes are also used by JSON. This could lead unfriendly configurations like "this\\\\.is\\\\.not\\\\.very\\\\.readable"

Use repeated separator to escape

Using double dots to escape separators is another alternative to try sticking to using only dots as a field separator.

Comparing:

With double dotsWith separator


Code Block
{
  "transforms.field.style": "nested",
  "transforms": "cast",
  "transforms.cast.type": "..."
  "transforms.cast.spec": "address..personal.country:string"
}



Code Block
{
  "transforms.field.style": "nested",
  "transforms.field.separator": "/", 
  "transforms": "cast",
  "transforms.cast.type": "..."
  "transforms.cast.spec": "address.personal/country:string",
}


Even though changing the separator represents yet another property to configure, it will be used in a minority of cases, and it could be easier to understand compared to escaping by repeating dots.

It also represents an approach that is similar to the "delimiter" in Flatten SMT, which could make it more familiar for Connect usersThis KIP proposes to simplify this configuration by replacing multiple invocations with one.

Potential KIPs

Future KIPs could extend this support for:

  • Recursive notation: name a field and apply it to all fields across the schema matching that name., as proposed by
    Jira
    serverASF JIRA
    serverId5aa69414-a9e9-3523-82ec-879b028fb15b
    keyKAFKA-10640

  • Access to arrays: Adding notation for arrays (e.g. []  notation ) to represent access to arrays and applying SMTs to fields within an array.