Project Proposal

[Zeppelin – 684 : Data mining, create notebooks with Analytics]

Name : Anish Singh

Background :

Zeppelin, a web based notebook is a very powerful tool for interactive data analysis that provides a visualization and easy to use interface on top of existing back end analytics engines such as Spark, Flink, Hive. A notebook is the primary way to analyze data in Zeppelin using the interpreters available. Currently, there exist few notebooks to demonstrate the visualization and analytics capabilities of Zeppelin. However, as the number and variety of public data sets is growing, there need to be more notebooks to demonstrate Zeppelin's capabilities for various datasets.

Description :

A lot of possibilities for emerging data sets exist, as I found out while researching on the Internet :

Emerging data sets for transportation to understand road accidents and disasters made available by the European Union.
Data sets from Supermarket stores to track customers in-store and do real time analytics.
Health-related data sets to track the health of people, to do genomic sequencing. MIRAGE( minimum information required for a glycomics(molecule) experiment) is another field for which public data is available.
World financial data such as the Balance of Payments, economic data made available by the International Monetary Fund.
Data set on on-line community interaction made available by the Stanford university.
Environment data about the changing climate, CO₂ emissions made available by the World Bank.
Data set about the usage of digital content by people and usage of Internet made available by the World Bank.
Other Miscellaneous data are also available on the websites of the World Bank organization, the European Union and the United States government.

> Alex: are you familiar with CommonCrawl datset http://commoncrawl.org and Warcbase project https://github.com/lintool/warcbase? It would be very cool to include those guys as well. It would be cool to pick may be less datasets but those more generic once that have a community around that might be interested\benefit from your work as well.

My main objective in the project would be to take up as many as possible of the above mentioned data sets and create notebooks for each of them using the existing support for the various interpreters. This would involve examining the datasets first to decide which of the interpreters would work best for which dataset and then to write out the notebook. Also Helium functionality may be added to enhance the notebooks.

Deliverables :

> Alex: can you please have 2 sections here for mid-term and final deliverables. I.e "min number of N notebooks before mid-term (at least K using Helium) and N after (at least L packaged though Helium)"

At the end of the project, the deliverables would include :

A set of notebooks for the above proposed data sets.
Documentation for the notebooks
Results of any tests and bug fixes that were encountered during the development phase.

Schedule :

April 22 – May 22 : This time would be utilized in interacting as much as possible with the community and the mentors and learning more about the project in general(the working of various components) and suggested methods of implementation in specific(such as prospects of inclusion of Helium in the project).

May 23 – June 25 : During the first half of the coding period, I would try to create at least two notebooks, in the order of priority as chosen by the mentors. For each of the notebooks, the time line has been broken down as follows :

Analysis of the dataset for deciding parameters to create the notebook (2 days).
Writing the code for the notebook (10 days).
Documentation and about the notebook (1- 2 days).

June 26 – July 30 : During the second half of the coding period, I would try to create at least two notebooks, in the order decided by the mentors. For each of the notebooks, the time line has been broken down as follows :

Analysis of the data set (2 days).
Writing the code for the notebook( 8 - 9 days).
Documentation and about the notebook (1- 2 days).

The time line breakdown is the same for the second half of the project as the first half, although I expect that once things are learned and grasped, work in the second half would be quicker than the first half. The creation of four notebooks is not an upper bound, as mentioned above, if time permits I would not hesitate to create more notebooks on other data sets.

July 30 – Pencils Down : This period would be used to improve existing documentation, testing, bug fixing, and other enhancements on the notebooks created.

Other Commitments :

I would be having my end semester exams from April 25 – May 9 and so I would only be able to commence with the community engagement period from May 10 – onwards. For the rest of the summer, I would be completely free with no commitments other than this project. I would easily be able to give 14 – 15 hours each day to the project to ensure its completion at all costs.

> Alex let's keep realistic expectations here of no more then 8 working hours a day

About Me :

I am a second year undergraduate student majoring in Computer Science and Engineering at the LNM Institute of Information Technology. I have a passion for mathematics and computer science( especially for big data analytics) and want to base my career on it. I have been learning Java programming language since the 8^th grade in school and have a lot of experience in object-oriented programming. Recently(December 2015), I came across Apache Spark as a data analytics engine in attempting to develop a share price prediction program using K-Nearest Neighbor algorithm which got me interested in the Apache Big Data ecosystem. I spent the January of 2016 learning Scala to be able to write programs using Spark.

I was able to successfully complete a project given to me by the Computer Society of India (college club) during the summer of 2015. This was a game written in C++ using the Qt libraries. The project sparked my interest in open source. Link to the project at github: https://github.com/anish18sun/Hyper-visor.git

I am deeply interested in the project and given a chance I would try as hard as possible to achieve the objectives laid down by the community. I want to be a part of the Apache community forever and my help and contributions to the community will continue even after the summer of code period.

Page tree

ZEPPELIN-648 proposal, Anish Singh﻿

Project Proposal

[Zeppelin – 684 : Data mining, create notebooks with Analytics]

ZEPPELIN-648 proposal, Anish Singh