Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Project Proposal

[Zeppelin – 684 : Data mining, create notebooks with Analytics]


Name : Anish Singh

Email : anish18sun@gmail.com

...

A lot of possibilities for emerging data sets exist, as I found out while researching on the Internet :

  1. Emerging data sets

...

  1. made available by the European Union

...

  1. on a number of subjects such as transportation, education, communication, population, economy and health.

  2. Health-related data sets to track the health of people, to do genomic sequencing. MIRAGE( minimum information required for a glycomics(molecule) experiment) is another field for which public data is available.

  3. World financial data such as the Balance of Payments, economic data made available by the International Monetary Fund.

  4. Data

...

  1. sets made available by the Stanford University on topics such as on-line community interaction.

  2. Data sets made available by the

...

Environment data about the changing climate, CO2 emissions made available by the World Bank.

...

Data set about the usage of digital content by people and usage of Internet made available by the World Bank.

  1. World Bank on various subjects such as poverty, income, population , growth(in GDP), environment(CO2 emissions), disease patterns across the world.

  2. Amazon Web Services public datasets provides a huge resource of datasets such as the Common Crawl dataset which can be analyzed for almost any information on the web using tools such as the Warcbase project

...

  1. .

My main objective in the project would be to take up as many as possible of the above mentioned data sets and create notebooks for each of them using the existing support for the various interpreters. This would involve examining the datasets first to decide which of the interpreters would work best for which dataset and then to write out the notebook. Also Helium functionality may be added to enhance the notebooks.

Deliverables :

. The main interpreters that I propose to use in the project are Spark and Flink. Spark has a variety of powerful features that make it suitable for the analysis of datasets. Spark's MLLib Machine Learning libraries may be used to build regression models of the datasets and predict the values of the test data based on the training data set. Regression analysis may be achieved through inbuilt classes such as 'LinearRegressionWithSGD' available in MLLib. Other forms of analysis (such as classification and clustering) may also be performed using these libraries.

Helium, which is a pluggable tool on top of Zeppelin may be used to run packaged user code as an application inside of a notebook based on the resources available in the Zeppelin resource pool. It may be used to create custom visualizations or enhancements to the analysis of data performed for the datasets for which the notebooks are created.

Deliverables :

Deliverables before the mid-term would include : 

  • A set of at least 2 notebooks for the above proposed datasets with at least 1 using Helium functionality.

At the end of the project, the deliverables would include :

  • A set of at least 6 notebooks for the above proposed data sets with at least 2 using Helium functionality.

  • Documentation for the notebooks.

  • Results of any tests and bug fixes that were encountered during the development phase.

I also propose to create a blog post to document my progress about the creation of notebooks all along the development cycle. The blog would contain detailed explanations about the various models implemented for the analysis of the datasets and it would welcome suggestions form anyone interested in commenting or advising anything about the notebooks being created or the approach taken to implement them.

Schedule :

April 22 – May 22 : This time would be utilized in interacting as much as possible with the community and the mentors and learning more about the project in general(the working of various components) and suggested methods of implementation in specific(such as prospects of inclusion of Helium in the project).

...

The time line breakdown is the same for the second half of the project as the first half, although I expect that once things are learned and grasped, work in the second half would be quicker than the first half. The creation of four six notebooks is not an upper bound, as mentioned above, if time permits I would not hesitate to create more notebooks(more than six) on other data sets.

July 30 – Pencils Down : This period would be used to improve existing documentation, testing, bug fixing, and other enhancements on the notebooks created.

...

I would be having my end semester exams from April 25 – May 9 and so I would only be able to commence with the community engagement period from May 10 – onwards. For the rest of the summer, I would be completely free with no commitments other than this project. I would easily be able to give 14 – 15 8 hours each day to the project to ensure its completion at all costs.

...