Goals

As of version 2.0.0 M2 the capabilities of the org.apache.nifi.serialization.record.RecordSchema are limited to the declaration of expected data type for record fields. This serves as a good platform for data validation but there are many possible restrictions the current implementation does not support. This proposal aims to extend these capabilities by adding the possibility to describe restrictions similar to the JSON Schema specification.

Background and Strategic Fit

In some real-life cases the existing capabilities seems insufficient leading to either workarounds with additional steps (which can cause performance degradation due to the additional, possibly misused processors) or incorrect results (due to loosely validated data). In order to improve improve the user experience and ease the setup of flows with fully validated dataset, improving the features of RecordSchema seems benefitial.

Multiple kinds of “schema declarations” might be turned into RecordSchema like Avro and JSON Schema with various restrictive capabilities. From these JSON Schema might be the most widely used and flexible. Because of it, JSON Schema might be a good model for the implementation: it can both provide a guideline for what restrictions can be supported and a sufficient interpretations to the users to describe their schema.

With this this document proposes to extend the RecordField with the option to add different FieldLevelRestrictions and the RecordSchema to add RecordLevelRestrictions (for details, please see the Technical Details section). With this, a number or restrictions must be implemented. In order to make this accessible for the users a “JSON Schema to Record Schema” converter is necessary together with a SchemaRegistry implementation which can be used to define the schemas.

Requirements

#

Title

Importance

1

Without extra restrictions, the current behaviour must not change

MUST

2

Provide field and structure level restriction capabilities based on the offerings of JSON Schema

MUST

3

Provide a simple JSON Schema based Schema Registry Controller Service including a “JSON Schema to Record Schema” converter.

MUST

Technical Details

Types

The basic type conversion from JSON format to RecordFieldType is the following:

JSON Schema Type

Allowed RecordFieldType

null

any

boolean

Boolean

object

Record

array

Array

number

Byte, Short, Int, Long, Bigint, Float, Double, Decimal

string

Timestamp, Date, Time, UUID, Char, String

integer

Byte, Short, Int, Long, Bigint

Note: JSON Schema allows a list of allowed types!

Support Matrix

The following RecordFieldType should support the given restrictions. Based on https://json-schema.org/draft/2020-12/json-schema-validation#name-a-vocabulary-for-structural


6.1.2. enum

6.1.3 const

6.2.1 multipleOf

6.2.2 maximum

6.2.3 exclusiveMaximum

6.2.4 minimum

6.2.5 exclusiveMinimum

6.2.1 maxLength

6.2.2 minLength

6.3.3 pattern

6.4.1. maxItems

6.4.2 minItems

6.4.2 uniqueItems

6.4.3 maxContains

6.4.4 minContains

6.5.1 maxProperties

6.5.2 minProperties

6.5.3 required

6.5.4 dependentRequired

Boolean

yes

yes


















Byte

yes

yes

yes

yes

yes

yes

yes













Short

yes

yes

yes

yes

yes

yes

yes













Int

yes

yes

yes

yes

yes

yes

yes













Long

yes

yes

yes

yes

yes

yes

yes













Bigint

yes

yes

yes

yes

yes

yes

yes













Float

yes

yes


yes

yes

yes

yes













Double

yes

yes


yes

yes

yes

yes













Decimal

yes

yes


yes

yes

yes

yes













Timestamp




















Date




















Time




















UUID




















Char




















Enum




















String








yes

yes

yes










Record
















yes

yes

yes

yes

Choice

? (tbd)

? (tbd)


















Array

yes

yes









yes

yes

yes

yes

yes





Map




















We distinguish between FieldLevelReqstrictions and RecordLevelRestrictions. The former category contains every restiction are used for RecordFields. The latter is the set of restriction for the Record itself.

Note: this is based on JSON Schema. If the original format differs, there might be other applications of the restrictions (For example: Map could be restricted by maxItems). For now these applications are ignored as with JSON Schema source there is no way to generate RecordSchema including applications other then noted in the matrix.

String format

The JSON Schema might provide however addition context to given a field, using the format attribute. Some of these predefined format might be used to achieve more accurate type correspondence. These:

  • date

  • date-time

This information might narrow down the valid types for a string type element.

Schema Registry

The AvroSchemaRegistry can be a good template for adding a simple-to-use schema registry backed by the newly added “JSON Schema to Record Schema” converter. This approach does not bring in additional dependencies and complexity and provides a familiar user experience.

Alternatives and open question

  • The functionality of the new schema registry would highly overlap with the AvroSchemaRegistry. It worth to measure the benefits of sharing the code base between the two.

  • Later on there might be other source formats (XSD?) which comes with the question if we prefer separate controller service for every format or rather one with the possibility to pick the format as strategy.

Assumptions

  • The implementation must be open for adding further restrictions

  • The “JSON Schema to Record Schema” converter in ConfluentSchemaRegistry will not be altered. In a follow up effort, it might be merged with the converted added by this change. This is in order to avoid any possible regression.

Questions

Question

Outcome

Currently there is a “nullable” flag in the RecordField . This overlaps with the “6.5.3. required” restriction. These might be merged.


Enum specification of the JSON Schema allows multiple data types. The RecordField does not seem to support this, expect maybe the Choice data type. It is suggested to allow only one type. See: https://json-schema.org/learn/miscellaneous-examples#enumerated-values




Not Doing

  • Schema inference will not be extended by any new restriction. The sampling is considered misleading to generate further restrictions.

  • The JSON capabilities of the current ConfluentSchemaRegistry will be not updated within this effort.

  • The support of schema referencing functions (like oneOf, anchro, $ref) are not supported in the initial version.

  • No labels

4 Comments

  1. Thanks for putting this together Simon Bence. I agree that it would be beneficial to support JSON Schemas, as it provides richer validation.

    However, an important distinction between something like JSON Schema and the Record Schema and is that JSON Schema provides a very intentional mechanism for serializing and de-serializing the schema. It is intended as a mechanism to write and convey semantics to the application. The Record Schema, on the other hand, is intended as a more ephemeral representation. An in-memory representation of a schema that need not provide mechanisms for conveying restrictions directly.

    Given that, I do not think it is wise to introduce specific validation logic as a first class citizen of RecordSchema / RecordField. Rather, we should consider that a RecordField may have zero or more FieldValidator s associated with it. I.e., RecordField  would have a new List<FieldValidator> , where FieldValidator  looks something like:

    public interface FieldValidator {
        String getDescription();
        boolean isValid(Object value);
    }

    In this way, we can implement any and all of the validations above, as well as any other validation logic that may be necessary.

    Additionally, this introduces rather little change to the API.


    As for implementation of a new in-memory schema registry, I think this makes great sense as well. I do not think we should worry about any sort of code reuse with AvroSchemaRegistry  as the code is quite trivial. In total it's about 162 lines, most of which is boilerplate (java imports, etc.), and documentation and Avro-specific validation that cannot be re-used. Additionally, it won't make sense to try to co-locate the two as the libraries that they depend on should be isolated, and the desire to use both is very unlikely so it makes sense to house them in different NARs.

  2. Thanks for sharing your thoughs Mark Payne

    I did not want to go in too much details in this document, first I wanted to clarify the approach but my idea in the matter overlaps with your recommendation. I do not intend to add validation logic directly to the RecordSchema/Field but was thinking on something like you call FieldValidator and thus, the RecordSchema/Field can have reference to a set of validators. But again: first I want to agree on the goal and the details of the requirements.

  3. Simon Bence got it. Given the "support matrix" table and comments above, it seemed to imply explicit mapping based on JSON Schema and NiFi field types, like they would be coupled together in some way.

    That said, this proposal, then, feels like implementation details without the higher level design.

    What matters most is that we introduce a clean API that clearly allows for validation. The mapping of JSON Schema rules to that API is really just an implementation detail that can even evolve over time.

    So the concept of supporting JSON Schema with richer validation is clearly a win. But what really should be reviewed and discussed is what that mechanism for doing so looks like. I.e., how the Record API should (and friends) should evolve. IMO, that's far more important and foundational than how JSON Schema maps into that API.

  4. With the support matrix and other details I aimed to achive a clear description of what I intend to support from JSON Schema and what are the connection points between the two structures. This is merely a declaration of the goals and boundaries of the effort.

    If we can agree on that (and to me it looks like based on your response) this is considered a beneficial improvement, we can move on and I will work out the details of the design. I have some experiments on the matter already, but I did not want to finalize anything before agreeing on what actually we want to achieve.