Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The source code can be downloaded from https://github.com/esgyn/code-examples/tree/master/src/main/java/org/trafodion/examples/udrs/udfs/table_valued/group_concat

JSON Column-izer

When stored in a database, JSON data is typically stored in a character column that holds the entire document, unless the JSON has been previously parsed into name-value pairs, which then presents different questions of how to store it and access it. 

A Hive table DDL can depict a JSON layout including its embedded structs and arrays.  For example, this JSON 

{"Employee" : {"Person" : {"First name" : "John", "Middle Name" : "Quincy", "Last Name" : "Smith" }, "Phones" : {"Home" : "555-555-1212", "Mobile" : "888-555-1212"} } }

...would be declared in Hive similar to the following (but column names need adjustment - Hive has problems with backticked column names within a struct):

CREATE TABLE  t1 (employee STRUCT <     

person STRUCT <

              `first name` : string,

              `middle name` : string,

              `last name` : string >,     

phones STRUCT <

              home : string,

              mobile : string > 

>)  

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

The serde interprets the JSON and provides access to individual JSON elements.  Whether you load data into this (INTERNAL) table or define it as an EXTERNAL table and link it to file(s) in HDFS, physically, the JSON is still a single string of characters - it is not stored as multiple columns.  A HiveQL SELECT statement would be like the following with JsonSerDe resolving the "column" reference to extract the correct value.  Note that use of hierarchical naming is essential.

SELECT employee.person.`first name` FROM t1;

However, when a variety of JSON formats occur within the same data stream, it becomes very cumbersome to define a single table with all the tags that accommodates all the possible forms.  Challenges such as these in making JSON query-able led to this UDF.

This is a Table Mapping UDF that expects a single column containing a JSON document.  Parameters passed in the call identify JSON tags whose values are to be output.  For this, we show creating a new table in Hive (or a similar one can be created in Trafodion).

> hive

CREATE TABLE t2 (jsondata STRING);

LOAD DATA LOCAL PATH '<json text input file>'  OVERWRITE INTO TABLE t2;

or

> hive

CREATE EXTERNAL TABLE t2 (jsondata STRING) ROW FORMAT DELIMITED STORED AS TEXTFILE LOCATION '<hdfs dir containing data>';

(files in the hdfs dir are implicitly linked to the table)

Applying the JSON layout above, one sample invocation of the UDF is

SELECT * FROM udf( unjson( TABLE(select * from hive.hive.t2), 'employee.person.last name', 'employee.person.first name', 'employee.phones.mobile' ));

The SELECT-list within the TABLE() specification should resolve to a single column containing the JSON.  The table referenced can be any Trafodion-supported type.  There is no defined limit on the number of tag parameters included in the call.  Note that the tag names match how they are spelled in the JSON document (case insensitive).  The output will be three columns for each row, either with the extracted values or nulls. 

The source can be downloaded from <HERE> (once I get it uploaded).

Trafodion steps after java compiling, jar'ing, and uploading to every node of the Trafodion cluser:

> sqlci

create library <your library name> file '<the full path name of your jar>';

create table_mapping function <what you want> ()
external name '<your class>'                       -- name of your class
language java
library <your library name>;

Manageability

 

 

Tools

 

SQL Scripts and Queries

 

Miscellaneous