Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current State: [UNDER DISCUSSION]

Discussion Thread: [...] https://lists.apache.org/thread.html/6f638139bb77019a649ec7034783a650e1f558ef75acc1dda991d573@%3Cdev.zeppelin.apache.org%3E

JIRA: ZEPPELIN-2019


2. Motivation

Apache Zeppelin provides valuable features for table manipulations such as built-in visualizations, pivoting and CSV download. However, these features are limited from the table size perspective. Currently, they are executed on the browser side and the table size is limited (configurable and 1000 rows by default). Thus moving these computations from in-browser to backend will be a starting point for handling large data and improving pivoting, filtering, full CSV download, pagination, and other functionalities.

 Furthermore, the tables across interpreter processes currently can’t be shared. For example, table from JDBC interpreter wouldn’t be accessible from SparkSQL or Python interpreters. So the idea here is to extend existing Zeppelin resource pool to share Table resources across interpreters. It would allow also to have one central Table menu to access and view table information of registered Table resources.

...

4. Public Interfaces

4.1. Interfaces for TableData related classes

TableData interface defines methods to handle a table resource. Each interpreter can implement its own TableData. The reason why we can’t make the global TableData class for all interpreters is that each interpreter uses a different storage and a different mechanism to export/import data.

...

4.2. Example Implementation: ZeppelinResourcePool as Spark Data Source

(image copied from from https://databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html)

 

Spark supports pluggable data sources. We can use make Zeppelin’s DistributedResourcePool a spark data source using Spark DataSource API. Please refer these articles for more information.

 

4.2.1. BaseRelation Implementation

...

  • For interpreters which use SQL

    • provide an interpreter option: create TableData whenever executing a paragraph

    • or provide new interpreter magic for it: %spark.sql_share, %jdbc.mysql_share, …

    • or automatically put all table results into the resource pool if they are not heavy (e.g keeping query only, or just reference for RDD)

    • If interpreter supports runtime interpreterparameters, we can use this syntax: %jdbc(share=true) to specify whether share the  table result or not

  • For interpreters which use programming language (e.g python)

    • provide API like z.put()

      Code Block
      languagescala
      linenumberstrue
      // infer instance type and convert it to predefined the `TableData` subclass such as `SparkDataFrameTableData`
      z.put (“myTable01”, myDataFrame01)
      
      // or force user to put the `TableData` subclass
      val myTableData01 = new SparkRDDTableData(myRdd01)
      z.put(“myTable01”, myTableData01)

       

...

The issues we discussed above can be implemented in this sequence.the following order of priority

  • ZEPPELIN-TBD: Adding pivot, filter methods to TableData

  • ZEPPELIN-TBD: ResourceRegistry

  • ZEPPELIN-TBD: Rest API for resource pool

  • ZEPPELIN-TBD: UI for Table page

  • ZEPPELIN-TBD: Apply pivot, filter methods for built-in visualizations

  • ZEPPELIN-TBD: SparkTableData, SparkSQLTableData, JDBCTableData, etc.

  • ZEPPELN-2029: ACL for ResourcePool

  • ZEPPELIN-2022: Zeppelin resource pool as a Spark Data Source

...

  • Watch / Unwatch: for automatic paragraph updating for Streaming Data Representation.

  • ZEPPELIN-1494: Bind JDBC result to a dataset on the Zeppelin context

  • Ability to construct table result from the resource pool in language interpreters (e.g python)

    • Let’s assume that we can build a pandas data frame using TableData

      Code Block
      languagepy
      linenumberstrue
      # in python interpreter
      
      t = z.get("tableResourceName") # will return object that has `hasNext` and `next`
      p = new PandasTableData(t)
      
      # use p.pandasInstance …