Helium - Brings Zeppelin to data analytics application platform

Motivation

Zeppelin provides pluggable Interpreter architecture which results in a wide variety of the supported backend system.
Each interpreter abstracts underlying computing framework complexity (eg. SparkInterpreter abstracts Spark cluster) with it's own interface (eg. SparkInterpreter provides scala/sql/python for the interface).

Also there is a powerful feature called "Angular Display system" that enables user to create his own front-end interface that interacts with interpreter.
And there is a "dependency loader" that enables them to load libraries from remote repository.

Putting it all gother, one could imagine a full application platform, on top of Apache Zeppelin.
So what I propose is a framework, code-named Helium that turns Zeppelin into a data analytics application platform by:

- Leveraging computing resources provided by Interpreters
- Generalizing dependency loader
- Providing SDK on top of Angular Display system
- adding a package repository

What is Helium Application?

Helium Application = View + Algorithm + Access to Resources

View

Anything you want to display inside of Zeppelin notebook.
Can be any standard html, css, javascript.
View and Algorithm can interact.

Algorithm

The code you want to run, which is any code that runs on JVM.

Resource

Provided by interpreter or provided by another Helium Application.

Every interpreter automatically provides result of last run.
Additionally they can provide their own resource (eg. SparkContext).
Also any user code in Helium Application can provide any resource they want.

The resource can be any java object.
So it can be a data, it can be an abstraction of computing (eg. SparkContext), it can be anything.

How Helium Application runs

Applications are packaged into Jars and published into maven repository.
Also a spec file in package registry is required.

Then, depending on the Resource that the resource pool has, Zeppelin automatically suggest possible Application that user can run.
When user selects an Application, that application is being downloaded and run on the interpreter process where resource exists.

SDK

User Application extends org.apache.zeppelin.helium.Application class in SDK.

SDK provides development mode, so you can actually run application inside of Zeppelin without full deployment.
In development mode an application automatically re-reads it's view as html/css/javascript resources changes, without the restart.

Here's short video how SDK works

Package Repository and spec file

Helium Application is packaged into the standard Jar file, therefor it can be distributed by maven repository.
Package Repository is actually collection of spec file. Each spec file provides information of:

- Name of Application
- Artifact name in maven repository
- Resources this application requires

The package repository is going to to be maintained as separate gitrepo with it's own homepage. (like spark-packages.org for spark package), so any user can add their applications there, without PMC review, wich scales well.
There will be a bot that automatically merges pull requests w/ a specfiles into the master branch of the repo.

I propose the repository
https://github.com/zeppelin-project/helium-packages

Implementation.

There're proof of concept implementation.
https://github.com/Leemoonsoo/incubator-zeppelin/tree/helium

Application examples

I have created some example applications based on PoC implementation.

Git commit data - datasource
https://github.com/Leemoonsoo/zeppelin-gitcommitdata

Wordcloud - visualize the paragraph's table result
https://github.com/Leemoonsoo/zeppelin-wordcloud

SparkMon - appliction that access spark
https://github.com/Leemoonsoo/zeppelin-sparkmon

Video

Here's video of three example applications

Page tree

Helium proposal