Helium - Brings Zeppelin to data analytics application platform
Motivation
Zeppelin provides pluggable Interpreter architecture which results in a wide variety of the supported backend system.
Each interpreter abstracts underlying computing framework complexity (eg. SparkInterpreter abstracts Spark cluster) with it's own interface (eg. SparkInterpreter provides scala/sql/python for the interface).
Also there is a powerful feature called "Angular Display system" that enables user to create his own front-end interface that interacts with interpreter.
And there is a "dependency loader" that enables them to load libraries from remote repository.
Putting it all gother, one could imagine a full application platform, on top of Apache Zeppelin.
So what I propose is a framework, code-named Helium that turns Zeppelin into a data analytics application platform by:
- Leveraging computing resources provided by Interpreters
- Generalizing dependency loader
- Providing SDK on top of Angular Display system
- adding a package repository
What is Helium Application?
Helium Application = View + Algorithm + Access to Resources
View
Anything you want to display inside of Zeppelin notebook.
Can be any standard html, css, javascript.
View and Algorithm can interact.
Algorithm
The code you want to run, which is any code that runs on JVM.
Resource
Provided by interpreter or provided by another Helium Application.
Every interpreter automatically provides result of last run.
Additionally they can provide their own resource (eg. SparkContext).
Also any user code in Helium Application can provide any resource they want.
The resource can be any java object.
So it can be a data, it can be an abstraction of computing (eg. SparkContext), it can be anything.
How Helium Application runs
Applications are packaged into Jars and published into maven repository.
Also a spec file in package registry is required.
Then, depending on the Resource that the resource pool has, Zeppelin automatically suggest possible Application that user can run.
When user selects an Application, that application is being downloaded and run on the interpreter process where resource exists.
SDK
User Application extends org.apache.zeppelin.helium.Application class in SDK.
SDK provides development mode, so you can actually run application inside of Zeppelin without full deployment.
In development mode an application automatically re-reads it's view as html/css/javascript resources changes, without the restart.
Here's short video how SDK works
Package Repository and spec file
Helium Application is packaged into the standard Jar file, therefor it can be distributed by maven repository.
Package Repository is actually collection of spec file. Each spec file provides information of:
- Name of Application
- Artifact name in maven repository
- Resources this application requires
The package repository is going to to be maintained as separate gitrepo with it's own homepage. (like spark-packages.org for spark package), so any user can add their applications there, without PMC review, wich scales well.
There will be a bot that automatically merges pull requests w/ a specfiles into the master branch of the repo.
I propose the repository
https://github.com/zeppelin-project/helium-packages
Implementation.
There're proof of concept implementation.
https://github.com/Leemoonsoo/incubator-zeppelin/tree/helium
Application examples
I have created some example applications based on PoC implementation.
Git commit data - datasource
https://github.com/Leemoonsoo/zeppelin-gitcommitdata
Wordcloud - visualize the paragraph's table result
https://github.com/Leemoonsoo/zeppelin-wordcloud
SparkMon - appliction that access spark
https://github.com/Leemoonsoo/zeppelin-sparkmon
Video
Here's video of three example applications