1. Status
Current State: [UNDER DISCUSSION]
Discussion Thread: [...]
JIRA: ZEPPELIN-2019
2. Motivation
Apache Zeppelin provides valuable features for table manipulations such as built-in visualizations, pivoting and CSV download. However, these features are limited from the table size perspective. Currently, they are executed on the browser side and the table size is limited (configurable and 1000 rows by default). Thus moving these computations from in-browser to backend will be a starting point for handling large data and improving pivoting, filtering, full CSV download, pagination, and other functionalities.
Furthermore, the tables across interpreter processes currently can’t be shared. For example, table from JDBC interpreter wouldn’t be accessible from SparkSQL or Python interpreters. So the idea here is to extend Zeppelin ResourcePool to share Table resources across interpreters. It would allow also to have one central Table menu to access and view table information of registered Table resources.
Thus the critical question is “How Zeppelin can support large data handling and share across interpreters?”. Here are already resolved issues and they can be clues to solving the problem.
ZEPPELIN-753: added abstraction for tabledata
ZEPPELIN-2020: implemented remote method invocation
Based on these works, this proposal aims to build a mechanism for handling table resource in backend and design API for the resource pool. This will bring Zeppelin to
register the table result as a shared resource
list all available (registered) tables
preview tables including its meta information (e.g columns, types, ..)
download registered tables as CSV, and other formats.
pivoting / filtering in backend to transforming larger data
cross join tables in different interpreters (e.g Spark interpreter uses a table result generated from JDBC interpreter)
For more future work tasks, please refer the 6. Potential Future Work section.
3. Proposed Changes
3.1. Overview: Sharing a table resource between different interpreters
This diagram shows how Spark Interpreter can query the table generated from JDBC interpreter.
An interpreter (A) a newly created table result can be registered as a resource.
Since every resource registered in a resource pool in an interpreter can be searched via `DisbitrubedResourcePool` and supports remote method invocation, other interpreters (B) can use it.
Let’s say JDBCInterpreter created a table result and keep it (JDBCTableData) into its resource pool.
Then, SparkInterpreter can fetch rows, columns via remote method invocation. if Zeppelin registers the distributed resource pool as Spark Data Source, SparkInterpreter can use all table resources in Zeppelin smoothly. (e.g Querying the table in SparkSQL as like a normal table)
3.2. Overview: How an interpreter can handle table resources
Here are is a more detailed view to explain how one interpreter can handle its TableData implementation with the resource pool.
4. Public Interfaces
4.1. Interfaces for TableData related classes
TableData interface defines methods to handle a table resource. Each interpreter can implement its own TableData. The reason why we can’t make the global TableData class for all interpreters is that each interpreter uses a different storage and a different mechanism to export/import data.
class | How it can get table data |
---|---|
InterpreterTableDataResult | Contains actual data in memory |
Interpreter specific TableData (e.g SparkTableData, SparkSQLTableData, …) | Knows how to reproduce the original table data. (e.g keep the query in case of JDBC, SparkSQL) |
4.1.1. Additional methods for TableData
public interface TableData { … /** * filter the input `TableData` based on columns. */ public TableData filter(List<String> columnNames); /** * Pivot the input `TableData` for visualizations */ public TableData pivot(List<String> keyColumns, List<String> groupColumns, List<String> valueColumns); … }
Each interpreter can implement its own TableData class. For example,
SparkInterpreter can have SparkTableData which
points RDD to get the table result
filter and pivot can be written by using Spark RDD APIs
JDBCInterpreter can have JDBCTableData which
keeps query to reproduce the table result
filter and pivot can be written using a query that has additional `where` and `group by` statements.
Some interpreters (e.g ShellInterpreter) might not be connected with external storage. In this case, those interpreters can use the InterpreterResultTableData class.
4.2. Example Implementation: ZeppelinResourcePool as Spark Data Source