You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Table of Content

  1. Background

  2. Deliverables

  3. Implementation

  4. Timeline

  5. About Me


Background

       As the amount of data has been increased intensively, storing and analyzing massive data in their storages has been an important issue for hundreds of thousands of enterprises. For this reason, many distributed systems such as Apache Hadoop, Apache HBase or Apache Cassandra have been introduced. Hadoop guarantees high-throughput sequential access, but is weak at updating each record and efficient random access. On the other hand, systems like HBase or Cassandra is good at low-latency record-level reads and writes, but weak at sequential read throughput. Apache Kudu is a new storage system designed to fill a gap between those issues. Kudu aims for both high-throughput sequential-access and low-latency random access.

       This document proposes making Tajo support Kudu as one of Tajo’s storage.

Deliverables

  1. A new submodule that supports Kudu as one of Tajo’s storage

  2. Documentation for connection establishment between Tajo and Kudu

  3. A user guide : How to integrate Kudu with Tajo

  4. Unit tests and results

Implementation

       The following things are mandatory issues to consider when implementing the submodule that connects between Tajo and Kudu:


  • Implement KuduScanner and KuduAppender

    • Split read

      • Consider How we can read the part of data specified in the given fragment.

    • Type conversion

      • Data types and internal representation should contain compatibility.

    • Projection push down

      • Tajo needs to be able to access only necessary columns.

  • Implement KuduTableSpace

    • Split generation

      • Decide a rule to divide data for distributed processing.

  • Implement KuduFragment

    • Contain the information of which part of data will be processed by each task.



Timeline

  • April 22 - May 22

    • Get in touch with the Tajo and the Kudu communities.

    • Analyze other storage modules like tajo-stor

      Lim, Byunghoon

       

      Email  : seian.hoon@gmail.com

       

      Computer Engineering

       

      Kyunghee University

       

      South Korea


      age-hbase or tajo-storage-hdfs.

    • Analyze Kudu architecture.

    • Architectural drafting.

  • May 22 - June 22

    • Confirm the architecture.

    • Start to implement the actual code.

    • Implement KuduScanner/KuduAppender

    • Implement KuduTableSpace (part)

  • June 23 - July 30

    • Implement KuduTableSpace (completion)

    • Implement KuduFragment

    • Unit Tests

  • July 30 - August 15

    • Fix minor bugs

    • Write documentation.

About Me

 

Lim, Byunghoon

 

Email  : seian.hoon@gmail.com

 

Computer Engineering

 

Kyunghee University

 

South Korea


       I am Byunghoon Lim, an undergraduate student majoring in Computer Engineering at Kyunghee University in South Korea.My main interests lie on Distributed Computation and Data Analysis with Machine Learning. I am familiar with C++, Java and Python. I have done following projects;cosine similarity based item recommendation using MongoDB and AWS EMR, movie recommendation system on Hadoop ecosystem and etc.

       Participating in the Apache Tajo project by implementing this issue will be an honor for me.

  • No labels