You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Status

Current state: Work In Progress

Discussion threadhere

JIRA

Motivation

There have been a few motivations for this work. Firstly, while leveraging KnoxShell classes and scripting to test Apache Knox releases, it became obvious that there was no easy way actually test Hive integrations.

This led to my writing a simple JDBC based groovy script for SQL based access to HiveServer2 through Knox. Available here [ 1 ].

This article and the use of the KnoxLine script within demos at DataWorks Summit drew some interest in it becoming an actual feature of KnoxShell.

While investigating this as a feature, it became clear that in order to accomplish KNOX-174 [ 2 ]we needed to have a better table rendering capability.

In the development of KnoxShellTable [ 3 ], and the use of its fluent API within the groovy based KnoxShell environment it became evident that this would be a powerful tool within KnoxShell for all sorts of tabular data.

Usecases

  • UC-1: JDBC Resultset Representations
  • UC-2: CSV Representations
  • UC-3: General Table Operations
    • Builders
    • Joining
    • Sorting, Selecting, Filtering, Calculations
  • UC-4: Persistence and Publishing
  • UC-5: KnoxLine SQL Shell (leveraging UC-1)
  • UC-6: Custom GroovySh Commands

UC-1: JDBC Resultset Representations

As mentioned in the motivation section, KnoxLine SQL Client requires a tabular representation of the data from a SQL/JDBC Resultset.

Moving code from the original KnoxLine script into a dedicated table builder for JDBC, we can easily represent the SQL results within a resulting KnoxShellTable.

It can create the cols based on the metadata of the resultset and accurately represent the data and perform type specific operations, sorts, etc.

UC-2: CSV Representations

Another dedicated table builder is provided for creating a table from a CSV file that is imported via URL.

Combined with all the general table operations and ability to join them with other KnoxShellTable representations, this allows for CSV data to be combined with JDBC datasets, filtered and republished as a new dataset or report to be rendered or even reexecuted later.

UC-3: General Table Operations

Builders

KnoxShellTable has a number of dedicated builders that have a fluent API for building table representations from various sources:

  • JDBC
  • CSV
  • Join
  • JSON

JDBC

ports = KnoxShellTable.builder().jdbc().
connect("jdbc:hive2://knox-host:8443/;ssl=true;transportMode=http;httpPath=topology/cdp-proxy-api/hive").
driver("org.apache.hive.jdbc.HiveDriver").
username("lmccay").pwd("xxxx").
sql("select * FROM ports");

Running the above within KnoxShell will submit the provided SQL to HS2, create and assign a new KnoxShellTable instance to the "ports" variable representing the border ports of entry data.

CSV

crossings = KnoxShellTable.builder().csv().
withHeaders().
url("file:///home/lmccay/Border_Crossing_Entry_Data.csv")

Running the above within KnoxShell will import a CSV file from local disk, create and assign a new KnoxShellTable instance to the "result" variable.

Join

crossings = KnoxShellTable.builder().join().
left(ports).
right(crossings).
on("code","Port Code"

Running the above within KnoxShell will import a join the two tables with a simple match of the values in left and right tables on each row that matches.

JSON

tornados = KnoxShellTable.builder().json().
url("file:///home/lmccay/.knoxshell/.tables/tornados.json")

Running the above within KnoxShell will rematerialize a table that was persisted as JSON and assign it to a local "tornados" variable.

Sorting

tornados.sort("state")

The above will sort the tornados table by the "state" column.

Selecting

tornados.select("state,cat,inj,fat,date,month,day,year")

The above will return and render a new table with only the subset of cols selected.

Filtering

tornados.filter().name("fat").greaterThan(0)

The above will return and render a table with only those tornados that resulted in one or more fatalities.

Fluent API

The above operations can be combined in a natural, fluent manner

tornados.select("state,cat,inj,fat,date,month,day,year").

  filter().name("fat").greaterThan(0).

  sort("state")

The above allows you to combine them by streaming them into each other in one line the select of only certain cols, the filtering of only those events with more than 0 fatalities and the much more efficient sort of the resulting table.

UC-4: Persistence and Publishing

Being able to create tables, combine them with other datasets, filter them and add new cols based on calculations between cols, etc is all great for creating tables in memory and working with them.

We also want to be able to persist these tables in a KnoxShellTable canonical JSON format of its own and be able to reload the same datasets later.

We also want to be able to take a given dataset and publish it as a brand new CSV file that can be pushed into HDFS, saved to local disk, written to cloud storage, etc.

In addition, we may want to be able to write it directly to Hive or another JDBC datasource.

JSON

tornados.toJSON()

The above will return and render a JSON representation of the tornados KnoxShellTable including: headers, rows, optionally title and optionally callHistory.

CSV

tornados.toCSV()

The above will return and render a CSV representation of the tornados KnoxShellTable including: headers (if present), and all rows.

Note that title and callhistory which are KnoxShellTable specifics are excluded and lost unless also saved as JSON.

UC-5: KnoxLine SQL Shell (leveraging UC-1)

The need for a simple shell with only SQL results is actually seeming less necessary with the same capability available through the groovy KnoxShell where the same results can be used for additional work.

We should determine as a community whether we really need to have such a SQL specific shell or not.

1 is the blog post that describes this capability.

Another KIP will represent additional KnoxShell improvements regarding KnoxShellTable that will make the use of the JDBC builder much easier and show why this particular usecase may not be required.

I will reference that KIP here once it is written.

UC-6: Custom GroovySh Commands

Groovy shell has exposes the ability to extend the commands available to help automate scripting or coding that you would otherwise need to do programmatically over and over.

By providing custom commands for KnoxShellTable operations,  builders and manipulation we can greatly simplify what would need to be done with the fluent API of KnoxShellTable and groovy/java code for saving state, etc.

Commands to be proposed:

  • Datasources (:datasource|:ds) CRUD and select operations for a set of JDBC datasources that are persisted to disk
  • SQL (:SQL|:sql) SQL query execution with persisted SQL history per datasource
  • JSON and CSV (:JSON|:json, :CSV|:csv) Import and Export from CSV and JSON formats
  • Table (:Table|:table) General table operations and aggregation/summaries of a table or datasets
  • Mount (:Mount|:mnt) Ability to mount a Knox and topology with metadata describing the exposed UIs, APIs and datasources to a given name from metadata exposed by a Knox API
  • POSIX style filesystem commands for HDFS and cloud storage (:ls, :rm, :mkdir, :chmod, :chown, :cat, :put, etc)

Resources

  1. SQL Client Example using KnoxShell in Apache Knox
  2. Unable to render Jira issues macro, execution error.
  3. Unable to render Jira issues macro, execution error.
  • No labels