Goals

As of version 2.0.0 M2 the capabilities of the org.apache.nifi.serialization.record.RecordSchema are limited to the declaration of expected data type for record fields. This serves as a good platform for data validation but there are many possible restrictions the current implementation does not support. This proposal aims to extend these capabilities by adding the possibility to describe restrictions similar to the JSON Schema specification.

Background and Strategic Fit

In some real-life cases the existing capabilities seems insufficient leading to either workarounds with additional steps (which can cause performance degradation due to the additional, possibly misused processors) or incorrect results (due to loosely validated data). In order to improve improve the user experience and ease the setup of flows with fully validated dataset, improving the features of RecordSchema seems benefitial.

Multiple kinds of “schema declarations” might be turned into RecordSchema like Avro and JSON Schema with various restrictive capabilities. From these JSON Schema might be the most widely used and flexible. Because of it, JSON Schema might be a good model for the implementation: it can both provide a guideline for what restrictions can be supported and a sufficient interpretations to the users to describe their schema.

With this this document proposes to extend the RecordField with the option to add different FieldLevelRestrictions and the RecordSchema to add RecordLevelRestrictions (for details, please see the Technical Details section). With this, a number or restrictions must be implemented. In order to make this accessible for the users a “JSON Schema to Record Schema” converter is necessary together with a SchemaRegistry implementation which can be used to define the schemas.

Requirements

#	Title	Importance
1	Without extra restrictions, the current behaviour must not change	MUST
2	Provide field and structure level restriction capabilities based on the offerings of JSON Schema	MUST
3	Provide a simple JSON Schema based Schema Registry Controller Service including a “JSON Schema to Record Schema” converter.	MUST

Technical Details

Types

The basic type conversion from JSON format to RecordFieldType is the following:

JSON Schema Type	Allowed RecordFieldType
null	any
boolean	Boolean
object	Record
array	Array
number	Byte, Short, Int, Long, Bigint, Float, Double, Decimal
string	Timestamp, Date, Time, UUID, Char, String
integer	Byte, Short, Int, Long, Bigint

Note: JSON Schema allows a list of allowed types!

Support Matrix

The following RecordFieldType should support the given restrictions. Based on https://json-schema.org/draft/2020-12/json-schema-validation#name-a-vocabulary-for-structural

	6.1.2. enum	6.1.3 const	6.2.1 multipleOf	6.2.2 maximum	6.2.3 exclusiveMaximum	6.2.4 minimum	6.2.5 exclusiveMinimum	6.2.1 maxLength	6.2.2 minLength	6.3.3 pattern	6.4.1. maxItems	6.4.2 minItems	6.4.2 uniqueItems	6.4.3 maxContains	6.4.4 minContains	6.5.1 maxProperties	6.5.2 minProperties	6.5.3 required	6.5.4 dependentRequired
Boolean	yes	yes
Byte	yes	yes	yes	yes	yes	yes	yes
Short	yes	yes	yes	yes	yes	yes	yes
Int	yes	yes	yes	yes	yes	yes	yes
Long	yes	yes	yes	yes	yes	yes	yes
Bigint	yes	yes	yes	yes	yes	yes	yes
Float	yes	yes		yes	yes	yes	yes
Double	yes	yes		yes	yes	yes	yes
Decimal	yes	yes		yes	yes	yes	yes
Timestamp
Date
Time
UUID
Char
Enum
String								yes	yes	yes
Record																yes	yes	yes	yes
Choice	? (tbd)	? (tbd)
Array	yes	yes									yes	yes	yes	yes	yes
Map

We distinguish between FieldLevelReqstrictions and RecordLevelRestrictions. The former category contains every restiction are used for RecordFields. The latter is the set of restriction for the Record itself.

Note: this is based on JSON Schema. If the original format differs, there might be other applications of the restrictions (For example: Map could be restricted by maxItems). For now these applications are ignored as with JSON Schema source there is no way to generate RecordSchema including applications other then noted in the matrix.

String format

The JSON Schema might provide however addition context to given a field, using the format attribute. Some of these predefined format might be used to achieve more accurate type correspondence. These:

date
date-time

This information might narrow down the valid types for a string type element.

Schema Registry

The AvroSchemaRegistry can be a good template for adding a simple-to-use schema registry backed by the newly added “JSON Schema to Record Schema” converter. This approach does not bring in additional dependencies and complexity and provides a familiar user experience.

Alternatives and open question

The functionality of the new schema registry would highly overlap with the AvroSchemaRegistry. It worth to measure the benefits of sharing the code base between the two.
Later on there might be other source formats (XSD?) which comes with the question if we prefer separate controller service for every format or rather one with the possibility to pick the format as strategy.

Assumptions

The implementation must be open for adding further restrictions
The “JSON Schema to Record Schema” converter in ConfluentSchemaRegistry will not be altered. In a follow up effort, it might be merged with the converted added by this change. This is in order to avoid any possible regression.

Questions

Question	Outcome
Currently there is a “nullable” flag in the RecordField . This overlaps with the “6.5.3. required” restriction. These might be merged.
Enum specification of the JSON Schema allows multiple data types. The RecordField does not seem to support this, expect maybe the Choice data type. It is suggested to allow only one type. See: https://json-schema.org/learn/miscellaneous-examples#enumerated-values

Not Doing

Schema inference will not be extended by any new restriction. The sampling is considered misleading to generate further restrictions.
The JSON capabilities of the current ConfluentSchemaRegistry will be not updated within this effort.
The support of schema referencing functions (like oneOf, anchro, $ref) are not supported in the initial version.

Space shortcuts

Child pages

Goals

Background and Strategic Fit

Requirements

Technical Details

Types

Support Matrix

String format

Schema Registry

Alternatives and open question

Assumptions

Questions

Not Doing

4 Comments

Mark Payne

Simon Bence

Mark Payne

Simon Bence

Space shortcuts

Child pages

JSON Schema support for Records

Goals

Background and Strategic Fit

Requirements

Technical Details

Types

Support Matrix

String format

Schema Registry

Alternatives and open question

Assumptions

Questions

Not Doing

4 Comments

Mark Payne

Simon Bence

Mark Payne

Simon Bence