You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Authors

Forward Xu, Jark Wu

Motivation

Currently, Flink SQL does not support the SQL 2016 JSON function. and many of the message middleware data are in JSON format. For example: Kafka, Flink SQL can support JSON operation to more easily parse the data in Kafka. In addition, the data in some relational databases is also stored and ingested in JSON format.

SQL/JSON path language

The SQL/JSON path language is a query language used by certain SQL operators (JSON_VALUE, JSON_QUERY, JSON_TABLE and JSON_EXISTS, collectively known as the SQL/JSON query operators) to query JSON text. The SQL/JSON path language is not, strictly speaking, SQL, though it is embedded in these operators within SQL. Lexically and syntactically, the SQL/JSON path language adopts many features of [ECMAscript], though it is neither a subset nor a superset of [ECMAscript]. The semantics of the SQL/JSON path language are primarily SQL semantics.

The SQL/JSON path language is used by the SQL/JSON query operators in the architecture shown in this diagram:

The SQL/JSON path language architecture

The SQL/JSON query operators share the same first three lines in the diagram, which are expressed syntactically in the <JSON API common syntax> that is used by all SQL/JSON query operators. This framework provides the following inputs to an SQL/JSON query operator:

  1. A context item (the JSON text to be queried).
  2. A path specification (the query to perform on the context item; this query is expressed in the SQL/JSON path language specified in [ISO9075-2], Subclause 9.38, “SQL/JSON path language: lexical elements” and Subclause 9.39, “SQL/JSON path language: syntax and semantics”.
  3. A PASSING clause (SQL values to be assigned to variables in the path specification, for example, as values used in predicates within the path specification).

The SQL/JSON operators effectively pass these inputs to a “path engine” that evaluates the path specification, using the context item and the PASSING clause to specify the values of variables in the path specification. The effective behavior of the path engine is specified in the General Rules of Subclause 9.39, “SQL/JSON path language: syntax and semantics”, in [ISO9075-2].

The result of evaluating a path specification on a context item and PASSING clause is a completion condition, and, if the completion condition is successful completion, an SQL/JSON sequence. The SQL/JSON query operators, in their General Rules, use the completion code and SQL/JSON sequence to complete the specific computation specified via the particular SQL/JSON query operator.

Errors can occur at the following junctures in this architecture:

1) An error can occur when converting an input. For example, if the context item does not parse as JSON text, then that is an input conversion error.

2) An error can occur while processing an SQL/JSON path expression. This category of errors is further subdivided as follows:

  1.   A structural error occurs when an SQL/JSON path expression attempts to access a non-existent element of an SQL/JSON array or a non-existent member of a JSON object.
  2.   A non-structural error is any other error during evaluation of an SQL/JSON path       expression; for example, divide by zero.

3) An error can occur when converting an output.

The SQL operators JSON_VALUE, JSON_QUERY, JSON_TABLE, and JSON_EXISTS provide the following mechanisms to handle these errors:

  1. The SQL/JSON path language traps any errors that occur during the evaluation of a <JSON filter expres- sion>. Depending on the precise <JSON path predicate> contained in the <JSON filter expression>, the result may be Unknown, True, or False, depending on the outcome of non-error tests evaluated in the <JSON path predicate>.
  2. The SQL/JSON path language has two modes, strict and lax, which govern structural errors, as follows: 
  1.  In lax mode:
  1.  If an operation requires an SQL/JSON array but the operand is not an SQL/JSON array, then the operand is first wrapped in an SQL/JSON array prior to performing the operation.
  2.  If an operation requires something other than an SQL/JSON array, but the operand is an SQL/JSON array, then the operand is unwrapped by converting its elements into an SL/JSON sequence prior to performing the operation.
  3.  After applying the preceding resolutions to structural errors, if there is still a structural error, the result is an empty SQL/JSON sequence.
  1.  In strict mode, if the structural error occurs within a <JSON filter expression>, then the error handling of <JSON filter expression> applies. Otherwise, a structural error is an unhandled error.
  1. Non-structural errors outside of a <JSON path predicate> are always unhandled errors, resulting in an exception condition returned from the path engine to the SQL/JSON query operator.
  2. The SQL/JSON query operators provide an ON ERROR clause to specify the behavior in case of an input conversion error, an unhandled structural error, an unhandled non-structural error, or an output conversion error.

Modes

The path engine has two modes, strict and lax. The motivation for these modes is that strict mode will be used to examine data from a strict schema perspective, for example, to look for data that diverges from an expected schema. Therefore, strict mode raises an error if the data does not strictly adhere to the requirements of a path expression. Lax mode is intended to be more forgiving, so lax mode converts errors to empty SQL/JSON sequences.

In addition, lax mode adopts the convention that an array of size 1 is interchangeable with the singleton. This convention is supported with the following conventions:

1) If an operation requires an array but the operand is not an array, then the operand is implicitly “wrapped” in an array.

2) If an operation requires a non-array but the operand is an array, then the operand is implicitly “unwrapped” into an SQL/JSON sequence.

These modes govern three aspects of path evaluation, as shown in the following table:

Three aspects of path evaluation governed by modes


lax

strict

Automatic unnesting of arrays

Certain path steps, such as the member accessor $.key, automatically iterate over SQL/JSON sequences. To make these iterative path steps friendlier for arrays, arrays are automatically unnested prior to performing the iterative path step. This means that the user does not need to use an explicit [*] to unnest an array prior to performing an iterative path step. This facilitates the use case where a field may be either an array or a scalar.

Arrays are not automatically unnested (the user can still write [*] to unnest an array explicitly).

Automatic wrapping within an array

Subscript path steps, such as $[0] or $[*], may be applied to a non-array. To do this, the non-array is implicitly wrapped in an array prior to applying the subscript operation. This also facilitates the use case where a field may be either an array or a scalar.

There is no automatic wrapping prior to subscript path steps.

Error han- dling

Many errors related to whether data is or is not an array or scalar are handled by the two preceding features. The remain- ing errors are classified as either struc- tural or non-structural. An example of a structural error is $.name if $ has no member whose key is name. Structural errors are converted to empty SQL/JSON sequences. An example of a non-struc- tural error is divide by zero; such errors are not elided.

Errors are strictly defined in all cases


Note that the path language mode is orthogonal to the ON ERROR clause. There are numerous use cases for having any combination of ON ERROR clauses combined with either strict or lax modes.

Example of strict vs lax

Consider the following data, stored in a table called Data:

pk

col

1

{ name: "Fred", phonetype: "work","phone#": "650-506-2051"}

2

{ name: "Molly",

  phones: [ { phonetype: "work","phone#": "650-506-7000" },

  { phonetype: "cell","phone#": "650-555-5555" }

]

3

{ name: "Afu",

  phones: [ { phonetype: "cell","phone#": "88-888-8888" } ]

}

4

{ name: "Justin"}

5

{ name: "U La La",phones: []}


This data has been created with a sloppy schema. If a person has just one phone (row 1), then the phonetype and phone# are members of the JSON object. If a person has more than one phone (row 2), then there is a member called phones whose value is an array holding the phone information. But sometimes a person with just one phone still has a phones array (row 3). Also, some people have no phones, which can be indicated by an absence of the phonetype and phone# members (row 4), or by the presence of a phones array whose value is empty (row 5).

Now the question is how to use JSON_TABLE to display all the name and phone information. Suppose one wants to get a table with columns called name, phonetype, and phone#. If a person has multiple phones, the display should be denormalized, with the person’s name repeated in multiple rows, in order to display each phone number in a separate row. If a person has no phones, the person name should appear in a single row, with nulls for the phone information.

Processing this data would be very difficult using strict mode. This is why lax mode is provided: to make it easier to deal with sloppy schemas such as this.

The solution to this use case is the following query:

SELECT D.pk, JT.name,

       COALESCE (JT."phone#", JT."phones.phone#") AS "phone#",

       COALESCE (JT."phonetype", JT."phones.phonetype#") AS "phonetype"

FROM Data AS D,

     JSON_TABLE (D.col, 'lax $'

         COLUMNS (

             name VARCHAR(30) PATH 'lax $.name',

             "phone#" VARCHAR(30) PATH 'lax $.phone#',

             "phonetype" VARCHAR(30) PATH 'lax $.phonetype',

             NESTED COLUMNS PATH 'lax $.phones[*]' (

   ) )

) AS JT


Above, two output columns of the JSON_TABLE have been underlined, and two others are boxed. To understand this query, note the following:

  1. Row 1 has phone# and phonetype as “bare” members of the outermost object. These two members will be picked up by the underlined columns called “phone#” and “phonetype”. The NESTED COLUMNS clause has a path that will find no rows. The default plan for NESTED COLUMNS is an outer join. Thus, there will be effectively a dummy row created with null values for the boxed columns. In the SELECT list, each COALESCE operator is used to choose the non-null values from an underlined column and the corresponding boxed column.
  2. Rows 2 and 3 do not have bare phone# and phonetype; instead they have an array called phones. In these rows, the underlined columns have paths that will find empty sequences, defaulting to the null value. The NESTED COLUMNS clause is used to iterate over the phones array, producing values for the boxed columns, and again, the COALESCE operators in the SELECT list retain the non-null values.
  3. Row 4 has no phone data at all. In this case, the underlined columns have paths that will find nothing (defaulting to null values). The NESTED COLUMNS clause also has a path that finds an empty sequence. Using the default outer join logic, this means that the boxed columns will also be null. The COALESCE operators must coalesce two null values, resulting in null.
  1. Row 5 has a phones array, but it is empty. This case is processed similarly to rows 2 and 3: the underlined columns are null because their paths are empty. The NESTED COLUMNS clause is used, but the array is empty, so this is an outer join with an empty table. Thus, the boxed columns also come up null, and the COALESCE operators combine these nulls to get null. The end result is the same as row 4.


Proposed Changes

We detail the implementation to support JSON functions in Flink SQL with this section.This article is mainly based on flink-table-planner-blink.

 SQL/JSON functions

This document mainly implements FLINK SQL JSON FUNCTION based on the SQL JSON FUNCTION already implemented in Jira CALCITE-2867.

In the following:

jsonValue is a character string containing a JSON value;

path is a character string containing a JSON path expression; mode flag strict or lax should be specified in the beginning of path.

Query Functions

Use in SQL:

SQL SYNTAX

DESCRIPTION

JSON_EXISTS(jsonValue, path [ { TRUE | FALSE | UNKNOWN | ERROR } ON ERROR ] )

Whether a jsonValue satisfies a search criterion described using JSON path expression path.

JSON_VALUE(jsonValue, path [ RETURNING type ] [ { ERROR | NULL | DEFAULT expr } ON EMPTY ] [ { ERROR | NULL | DEFAULT expr } ON ERROR ] )

Extract an SQL scalar from a jsonValue using JSON path expression path.

JSON_QUERY(jsonValue, path [ { WITHOUT [ ARRAY ] | WITH [ CONDITIONAL | UNCONDITIONAL ] [ ARRAY ] } WRAPPER ] [ { ERROR | NULL | EMPTY ARRAY | EMPTY OBJECT } ON EMPTY ] [ { ERROR | NULL | EMPTY ARRAY | EMPTY OBJECT } ON ERROR ] )

Extract a JSON object or JSON array from jsonValue using the path JSON path expression.


Note: The ON ERROR and ON EMPTY clauses define the fallback behavior of the function when an error is thrown or a null value is about to be returned.

The ARRAY WRAPPER clause defines how to represent a JSON array result in JSON_QUERY function. The following examples compare the wrapper behaviors.

Use in TableApi:

JAVA/PYTHON/SCALA

DESCRIPTION

STRING.jsonExists(path)

Whether a jsonValue satisfies a search criterion described using JSON path expression path.Table API currently only support path string parameter, doesn't support ON ERROR clause.

STRING.jsonValue(path)

Extract an SQL scalar from a jsonValue using JSON path expression path.Table API currently only support path string parameter, doesn't support ON ERROR clause.

STRING.jsonQuery(path)

Extract a JSON object or JSON array from jsonValue using the path JSON path expression.Table API currently only support path string parameter, doesn't support ON ERROR clause.


Example Data:

{"a": "[1,2]", "b": [1,2], "c": "hi"}


Comparison:

OPERATOR

$.A

$.B

$.C

JSON_EXISTS

true

true

true

JSON_VALUE

[1, 2]

error

hi

JSON QUERY WITHOUT ARRAY WRAPPER

error

[1, 2]

error

JSON QUERY WITH UNCONDITIONAL ARRAY WRAPPER

[ “[1,2]” ]

[ [1,2] ]

[ “hi” ]

JSON QUERY WITH CONDITIONAL ARRAY WRAPPER

[ “[1,2]” ]

[1,2]

[ “hi” ]

Constructor Functions

Use in SQL:

SQL SYNTAX

DESCRIPTION

JSON_OBJECT( { [ KEY ] name VALUE value [ FORMAT JSON ] | name : value [ FORMAT JSON ] } * [ { NULL | ABSENT } ON NULL ] )

Construct JSON object using a series of key (name) value (value) pairs.

JSON_OBJECTAGG( { [ KEY ] name VALUE value [ FORMAT JSON ] | name : value [ FORMAT JSON ] } [ { NULL | ABSENT } ON NULL ] )

Aggregate function to construct a JSON object using a key (name) value (value) pair.

JSON_ARRAY( { value [ FORMAT JSON ] } * [ { NULL | ABSENT } ON NULL ] )

Construct a JSON array using a series of values (value). 

JSON_ARRAYAGG( value [ FORMAT JSON ] [ ORDER BY orderItem [, orderItem ]* ] [ { NULL | ABSENT } ON NULL ] )

Aggregate function to construct a JSON array using a value (value).


Use in TableApi:

JAVA/PYTHON/SCALA

DESCRIPTION

jsonObject(ANY1, ANY2, ...)

Construct JSON object using a series of key (name) value (value) pairs.Table API currently only supports JSON string operations.

jsonObjectAgg(ANY1,ANY2)

Aggregate function to construct a JSON object using a key (name) value (value) pair.Table API currently only supports JSON string operations.

jsonArray(ANY1, ANY2, ...)

Construct a JSON array using a series of values (value). Table API currently only supports JSON string operations.

jsonArrayAgg(ANY)

Aggregate function to construct a JSON array using a value (value). Table API currently only supports JSON string operations.


Note: The flag FORMAT JSON indicates the value is formatted as JSON character string. When FORMAT JSON is used, the value should be de-parse from JSON character string to a SQL structured value.

ON NULL clause defines how the JSON output represents null values. The default null behavior of JSON_OBJECT and JSON_OBJECTAGG is NULL ON NULL, and for JSON_ARRAY and JSON_ARRAYAGG it is ABSENT ON NULL.

If ORDER BY clause is provided, JSON_ARRAYAGG sorts the input rows into the specified order before performing aggregation. 

Comparison:

OPERATOR

RESULT

JSON_OBJECT('id', 87, 'name', 'carrot')

{"id": 87, "name": "carrot"} 

JSON_ARRAY(1, "abc", NULL, TRUE)

[1, "abc", null, true]


 Simple Data Table T1:

o_id

attribute

value

2

color

red

2

fabric

silk

3

color

green

3

shape

square


SELECT o_id, JSON_OBJECTAGG(attribute, value) FROM t1 GROUP BY o_id;

OPERATOR

o_id

AFTER AGG

JSON_OBJECTAGG(attribute, value)

2

{"color": "red", "fabric": "silk"}

JSON_OBJECTAGG(attribute, value)

3

{"color": "green", "shape": "square"}


SELECT o_id, JSON_ARRAYAGG(attribute) AS attributes FROM t1 GROUP BY o_id;

OPERATOR

o_id

AFTER AGG

JSON_ARRAYAGG(attribute)

2

["color", "fabric"] 

JSON_ARRAYAGG(attribute)

3

["color", "shape"]


Comparison Operators

Applications will frequently want to ensure that the data they expect to consume as JSON data is, indeed, JSON data. The IS JSON predicate determines whether the value of a specified string does or does not conform to the structural rules for JSON. The syntax of the IS JSON predicate is:

Use in SQL:

SQL SYNTAX

DESCRIPTION

jsonValue IS JSON [ VALUE ]

Whether jsonValue is a JSON value.Table API functions currently only support JSON strings as input parameters.

jsonValue IS NOT JSON [ VALUE ]

Whether jsonValue is not a JSON value.Table API functions currently only support JSON strings as input parameters.

jsonValue IS JSON SCALAR

Whether jsonValue is a JSON scalar value.Table API functions currently only support JSON strings as input parameters.

jsonValue IS NOT JSON SCALAR

Whether jsonValue is not a JSON scalar value.Table API functions currently only support JSON strings as input parameters.

jsonValue IS JSON OBJECT

Whether jsonValue is a JSON object.Table API functions currently only support JSON strings as input parameters.

jsonValue IS NOT JSON OBJECT

Whether jsonValue is not a JSON object.Table API functions currently only support JSON strings as input parameters.

jsonValue IS JSON ARRAY

Whether jsonValue is a JSON array.Table API functions currently only support JSON strings as input parameters.

jsonValue IS NOT JSON ARRAY

Whether jsonValue is not a JSON array.Table API functions currently only support JSON strings as input parameters.


Use in TableApi:

JAVA/PYTHON/SCALA

DESCRIPTION

ANY.isJsonValue()

Whether jsonValue is a JSON value.Table API functions currently only support JSON strings as input parameters.

ANY.isNotJsonValue()

Whether jsonValue is not a JSON value.Table API functions currently only support JSON strings as input parameters.

ANY.isJsonScalar()

Whether jsonValue is a JSON scalar value.Table API functions currently only support JSON strings as input parameters. 

ANY.isNotJsonScalar()

Whether jsonValue is not a JSON scalar value.Table API functions currently only support JSON strings as input parameters.

ANY.isJsonObject()

Whether jsonValue is a JSON object.Table API functions currently only support JSON strings as input parameters.

ANY.isNotJsonObject()

Whether jsonValue is not a JSON object.Table API functions currently only support JSON strings as input parameters.

ANY.isJsonArray()

Whether jsonValue is a JSON array.Table API functions currently only support JSON strings as input parameters.

ANY.isNotJsonArray()

Whether jsonValue is not a JSON array.Table API functions currently only support JSON strings as input parameters.



The test example:

tester.checkBoolean("'{}' is json value", true);

tester.checkBoolean("'{]' is json value", false);

tester.checkBoolean("'{}' is json object", true);

tester.checkBoolean("'[]' is json object", false);

tester.checkBoolean("'{}' is json array", false);

tester.checkBoolean("'[]' is json array", true);

tester.checkBoolean("'100' is json scalar", true);

tester.checkBoolean("'[]' is json scalar", false);

tester.checkBoolean("'{}' is not json value", false);

tester.checkBoolean("'{]' is not json value", true);

tester.checkBoolean("'{}' is not json object", false);

tester.checkBoolean("'[]' is not json object", true);

tester.checkBoolean("'{}' is not json array", true);

tester.checkBoolean("'[]' is not json array", false);

tester.checkBoolean("'100' is not json scalar", false);

tester.checkBoolean("'[]' is not json scalar", true);


Implementation Details

Table Api interface proposal

I suggest that the first step about the table api does not implement the passing syntax. Start with a simple implementation and later implement further functionality if further functionality is required.

Let's take the JSON_EXISTS implementation as an example.

BuiltInFunctionDefinitions.java

// json functions

public static final BuiltInFunctionDefinition JSON_EXISTS =

new BuiltInFunctionDefinition.Builder()

.name("json_exists")

.kind(SCALAR)

.outputTypeStrategy(TypeStrategies.MISSING)

.build();


expressionDsl.scala

  /**

    * Returns a boolean of json_exists.

    * E.g. json_exists("{\"foo\":\"bar\"}".345,'strict $.foo') to true.

    */

  def jsonExists(path: Expression): Expression = unresolvedCall(JSON_EXISTS, expr, path)


DirectConvertRule.java

// json functions

DEFINITION_OPERATOR_MAP.put(BuiltInFunctionDefinitions.JSON_EXISTS, FlinkSqlOperatorTable.JSON_EXISTS);


FlinkSqlOperatorTable.java

// JSON FUNCTIONS

public static final SqlFunction JSON_EXISTS = SqlStdOperatorTable.JSON_EXISTS;


BuiltInMethods.scala

  val JSON_EXISTS = Types.lookupMethod(classOf[JsonFunctions], "jsonExists",

    classOf[String], classOf[String])


FunctionGenerator.scala

  addSqlFunctionMethod(

    JSON_EXISTS,

    Seq(VARCHAR, VARCHAR),

    BuiltInMethods.JSON_EXISTS)


StringCallGen.scala

case JSON_EXISTS if operands.size == 2 &&

  isCharacterString(operands.head.resultType) &&

  isCharacterString(operands(1).resultType) =>

  methodGen(BuiltInMethods.JSON_EXISTS)


PlannerExpressionConverter.scala

case JSON_EXISTS =>

  assert(args.size == 2)

  JsonExists(args.head, args.last)


jsonExpressions.scala

package org.apache.flink.table.planner.expressions

import org.apache.flink.api.common.typeinfo.BasicTypeInfo._

import org.apache.flink.api.common.typeinfo.TypeInformation

import org.apache.flink.table.planner.typeutils.TypeInfoCheckUtils

import org.apache.flink.table.planner.validate.{ValidationFailure, ValidationResult}

case class JsonExists(child: PlannerExpression, path: PlannerExpression)

  extends PlannerExpression with InputTypeSpec {

  override private[flink] def resultType: TypeInformation[_] = BOOLEAN_TYPE_INFO

  override private[flink] def children: Seq[PlannerExpression] =

    Seq(child, path)

  override private[flink] def expectedTypes: Seq[TypeInformation[_]] =

    Seq(STRING_TYPE_INFO, STRING_TYPE_INFO)

  override def toString: String = s"json_exists(${children.mkString(",")})"

  override private[flink] def validateInput(): ValidationResult = {

    if (child != null && path != null) {

      if (!TypeInfoCheckUtils.isString(child.resultType) ||

        !TypeInfoCheckUtils.isString(path.resultType)) {

        ValidationFailure(s"json_exists num requires int, get " +

          s"$child : ${child.resultType}, $path : ${path.resultType}")

      }

    }

    TypeInfoCheckUtils.assertNumericExpr(BOOLEAN_TYPE_INFO, s"json_exists base :$child")

  }

}


ScalarTypesTestBase.scala

testData.setField(55, "{\"foo\":\"bar\"}")

/* 55 */ Types.STRING)


ScalarFunctionsTest.scala

  //-------------------------------------------------------------------

  // JSON functions

  //-------------------------------------------------------------------

  @Test

  def testJsonExists(): Unit = {

    testAllApis(

      'f55.jsonExists("strict $.foo"),

      "f55.json_exists('strict $.foo')",

      "json_exists(f55, 'strict $.foo')",

      "true")

  }


JsonITCase.scala

/**

  * tests for Json Table Api

  */

@RunWith(classOf[Parameterized])

class JsonITCase(mode: StateBackendMode) extends StreamingWithStateTestBase(mode) {

  // input data

  val data = List(

    ("{\"foo\":\"bar\"}"),

    ("flink")

  )

  @Test

  def testJsonExists(): Unit = {

    val stream = env.fromCollection(data)

    val table = stream.toTable(tEnv, 'str)

    val resultTable = table

      .select('str, 'str.jsonExists("strict $.foo"))

    val sink = new TestingAppendSink

    resultTable.toAppendStream[Row].addSink(sink)

    env.execute()

    val expected = mutable.MutableList("flink,false", "{\"foo\":\"bar\"},true")

    assertEquals(expected.sorted, sink.getAppendResults.sorted)

  }

}



Migration Plan and Compatibility

It is a new feature for Flink SQL, there is no migration needed.The implementation sequence is as follows:

1.Implement Comparison Operators.

2.Implement Query Functions.

3.Implement Constructor Functions.

References

[1] A more detailed JSON standard description can be viewed: c067367_ISO_IEC_TR_19075-6_2017

[2] https://issues.apache.org/jira/browse/CALCITE-2867

[3] https://calcite.apache.org/docs/reference.html#json-functions

  • No labels