Status
Current state: Under Discussion
Discussion thread: here
JIRA: here
Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).
Motivation
Single Message Transforms (SMT), KIP-66, have greatly improved Connector's usability by enabling processing input/output data without the need for additional streaming applications.
Though, these benefits have been limited by SMTs limited to fields available on the root structure:
This KIP is aimed to include support for nested structures on the existing SMTs — where this makes sense — and to include the abstractions to reuse this in future SMTs.
Public Interfaces
From the existing list of SMTs, there are the following to be impacted by this change:
New configuration flags
Name | Type | Default | Importance | Documentation |
---|---|---|---|---|
transforms.<name>.field.style | STRING | plain |
| Permitted values: |
transforms.<name>.field.separator | STRING | . | LOW | Permitted values: " |
Example:
{ "transforms": "cast", "transforms.cast.field.style": "nested", "transforms.cast.type": "..." "transforms.cast.spec": "address.personal:string", }
SMTs affected
Extending the support for field configuration for dotted separation:
Cast
: extendspec
to support nested notation.ExtractField
: extend thefield
to support nested notation.HeaderFrom
: extend thefields
list to support nested notation.MaskField
: extend thefields
list to support nested notation.ReplaceField
: extend theinclude
andexclude
lists to support nested notation.TimestampConverter
: extend thefield
to support nested notation.ValueToKey
: extend thefields
list to support nested notation.InsertField
: Extend field configs to support nested notation.
Will require additional configurations:
HoistField
: add ahoisted
config to point to a specific path to hoist.Name Type Default Importance Documentation hoisted
STRING
<empty> MEDIUM
Path to the element to be hoisted. If empty, the root struct/map is hoisted. For example:
hoisted = nested.val field = line value (before): { "nested": { "val": 42, "other val": 96 } } value (after): { "nested": { "line": { "val": 42, }, "other val": 96 } }
These SMT do not require nested structure support:
DropHeaders
: Drop one or multiple headers.Filter
: Drops the whole message based on a predicate.InsertHeader
: Insert a specific message to the header.RegexRouter
: Acts on the topic name.SetSchemaMetadata
: Acts on root schema.TimestampRouter
: Acts on timestamp.Flatten
: Acts on the whole key or message.
Proposed Changes
Nested notation
Using dots tends to be the most intuitive way to access the nested record structures, e.g. jq
tooling already uses it https://stedolan.github.io/jq/manual/#Basicfilters and will cover most of the scenarios.
Dots are already allowed as part of element names on JSON (i.e. Schemaless) records (e.g. {
'nested.key': {'val':42}}
). Instead of escaping them with backslashes, which in JSON configurations will lead to unfriendly configurations, it's proposed to offer a configuration to switch to another separator.
If users recognize that their field names include dots or other separators, they could define another one to simplify their configuration.
Compatibility, Deprecation, and Migration Plan
Existing SMT configurations will not be affected by these changes as the default field.style
is plain
, which represents the current behavior.
Rejected Alternatives
Keep ExtractField
as it is and use it multiple times until reaching nested fields
This KIP proposes to simplify this configuration by replacing multiple invocations with only one nested one.
Use dots as the only separator and escape with backslashes when collides
Trying to keep only one separator, one of the alternatives is to use dots to separate; if it collides with the existing field names use backslashes "\" to represent dots that are part of the name e.g. "this.field" (which would refer to the nested field "field" under the top-level "this" field), and "this\.field" (which would refer to the field named "this.field").
However, backslashes are also used by JSON. This could lead unfriendly configurations like "this\\\\.is\\\\.not\\\\.very\\\\.readable"
Use repeated separator to escape
Using double dots to escape separators is another alternative to try sticking to using only dots as a field separator.
Comparing:
With double dots | With separator |
---|---|
{ "transforms": "cast", "transforms.cast.field.style": "nested", "transforms.cast.type": "..." "transforms.cast.spec": "address..personal.country:string" } | { "transforms": "cast", "transforms.cast.field.style": "nested", "transforms.cast.field.separator": "/", "transforms.cast.type": "..." "transforms.cast.spec": "address.personal/country:string", } |
Even though changing the separator represents yet another property to configure, it will be used in a minority of cases, and it could be easier to understand compared to escaping by repeating dots.
It also represents an approach that is similar to the "delimiter" in Flatten SMT, which could make it more familiar for Connect users.
Potential KIPs
Future KIPs could extend this support for:
- Recursive notation: name a field and apply it to all fields across the schema matching that name, as proposed by
- Access to arrays: Adding notation for arrays (e.g. []) to represent access to arrays and applying SMTs to fields within an array.