Proposal -- Support Kudu as one of Tajo’s storage[TAJO-2046]

Google Summer of Code 2016

Project Proposal [TAJO-2046]

Support Kudu as one of Tajo’s storage

Table of Content

Background
Deliverables
Implementation
Timeline
Community Engagement
Further Development
Other Commitments
About Me

Background

As the amount of data has been increased intensively, storing and analyzing massive data in their storages has been an important issue for hundreds of thousands of enterprises. For this reason, many distributed systems such as Apache Hadoop, Apache HBase or Apache Cassandra have been introduced. Hadoop guarantees high-throughput sequential access, but is weak at updating each record and efficient random access. On the other hand, systems like HBase or Cassandra is good at low-latency record-level reads and writes, but weak at sequential read throughput. Apache Kudu is a new storage system designed to fill a gap between those issues. Kudu aims for both high-throughput sequential-access and low-latency random access.

This document proposes making Tajo support Kudu as one of Tajo’s storage.

Deliverables

A new submodule that supports Kudu as one of Tajo’s storage
Documentation for connection establishment between Tajo and Kudu
A user guide : How to integrate Kudu with Tajo
Unit tests and results

Implementation

The above image is how storage module connects to other storage. The storargeManager is inside of Tajo Worker and it tries to connect other external storage to get data.

The following things are mandatory issues to consider when implementing the submodule that connects between Tajo and Kudu:

Implement KuduScanner and KuduAppender

Split read

Consider How we can read the part of data specified in the given fragment.

Type conversion

Data types and internal representation should contain compatibility.

Projection push down

Tajo needs to be able to access only necessary columns.

Implement KuduTableSpace

Split generation

Decide a rule to divide data for distributed processing.

Implement KuduFragment

Contain the information of which part of data will be processed by each task.

Timeline

April 22 - May 22

Get in touch with the Tajo and the Kudu communities.
Analyze other storage modules like tajo-storage-hbase or tajo-storage-hdfs.
Analyze Kudu architecture.
Architectural drafting.

May 22 - June 22

Confirm the architecture.
Start to implement the actual code.
Implement KuduScanner/KuduAppender
Implement KuduTableSpace (part)

June 23 - July 30

Implement KuduTableSpace (completion)
Implement KuduFragment
Unit Tests

July 30 - August 15

Fix minor bugs
Write documentation.

Community Engagement

Engineers working for Apache Tajo communicate with mailing list (dev@tajo.apache.org).

I can ask an opinion and even get some tips there.

The project issues including bugs and new features are listed on JIRA. (http://issues.apache.org/jira/browse/TAJO)
The actual code is managed by Github repository (http://github.com/apache/tajo)

When the code is uploaded by the contributors and they send a pull request, the committers review the code and decide it to get involved.
If the code is rejected by some reasons, the committers leave a comment.

Further Development

When the development for the submodule is done, I will not stop contributing to the TAJO project. With an experience that I gain from this program, I will keep developing other similar modules like mysql connector. Of course, I will keep my eyes on the Kudu model if there’s some bugs in the future.

Other commitments

Nothing special. I can focus on this project every day.

About Me

Lim, Byunghoon

Email : seian.hoon@gmail.com

Computer Engineering

Kyunghee University

South Korea

I am Byunghoon Lim, an undergraduate student majoring in Computer Engineering at Kyunghee University in South Korea. My main interests lie on Distributed Computation and Data Analysis with Machine Learning. I am familiar with C++, Java and Python. I have done following projects;cosine similarity based item recommendation using MongoDB and AWS EMR, movie recommendation system on Hadoop ecosystem and etc.

Participating in the Apache Tajo project by implementing this issue will be an honor for me.

Child pages