Table of Content
- Background
- Deliverables
- Implementation
- Timeline
- Results for the Apache community
- Further Development
- About me
- Other commitments
- Community Engagement
1. Project Background
Apache Tajo (Future of Data Warehouse) is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed to provide low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.[http://tajo.apache.org/] Tajo currently embeds HDFS, S3, Openstack, HBase, RDBMS storage plugins, so users can connect those other data sources to Apache Tajo. [ http://tajo.apache.org/docs/current/storage_plugins/overview.html]
MongoDB is a open source, cross-platform document-oriented, NoSQL database. MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON). Like other NoSQL database it supports dynamic schema design allowing documents in a collection to have different fields and structures.
As mentioned in the first paragraph Apache Tajo embeds several storage plugins https://github.com/apache/tajo/tree/master/tajo-storage . The project propose to add a MongoDB storage plugin to tajo-storage. Implementing the new module tajo-storage-mongodb (storage plugin for MongoDB) will be the major part of the project.
2. Deliverables
Completed tajo-storage-mongodb module.
Unit Testing - Test Code to check connectivity to a MongoDB database
Maven Build Configuration for the new module
“MongoDB Integration” tutorial page to the Apache Tajo docs.(Example:-https://tajo.apache.org/docs/current/hbase_integration.html)
A documentation on module architecture
3. Implementation
The purpose of storage service is to connect to the underlying storage system and provide a clear interface to the upper layers of Tajo.
Reference: http://www.slideshare.net/jihoonson/query-optimization-in-apache-tajo?next_slideshow=1
The tajo-storage-mongodb module will contain following important components.
MongodbTableSpace
MongodbFragment
MongodbAppender
MongodbScanner
TestCode to be used in Unit Testing
and it will contain other necessary supportive components which are specific to mongodb. Those components will be used to handle database access. Java MongoDB Driver (https://docs.mongodb.org/ecosystem/drivers/java/) will be used as the database driver in these implementations.
AbstractMongodbQueryExecutor
This class will work as a Adapter to the mongo driver interface.
MongodbQueryExecutor ( The concrete class )
- Which will use Mongo Java Driver( http://mongodb.github.io/mongo-java-driver/3.0/driver/ ) to access database.
MongodbConnectionInfo
Other than implementing mongodb-storage module it will be required to update configuration modules to allow support for MongoDB . The project task will be to implement above modules.
Here is a abstract diagram which describes the module implementation.
4. Timeline
With the advises of Mentors (Jihoon Son | JaeHwa Jung) I have already setup the development environment in a Ubuntu virtual machine. IDE used is intelliJ Idea. I am going follow the following schedule during the coding period and community bonding period.
Community Bonding Period : Maintaining regular discussions with the mentors and working on the material and guidance they provide. Going through all the storage drivers again and study their architecture. Discussing on the most suitable architecture for the MongoDB storage driver with mentors.
There will be around 4 weeks from the start of coding(23th May) till the start of mid-term evaluations(20th June)
Week 1 : Finalize the architecture and complete class structure. Create dummy classes and methods without writing the actual implementation. Suitable class, attributes, method names will be decided at this step. It will be really helpful at the implementation process.
Week 2 : Implementing the actual code for MongodbConnectionInfo class and check the connectivity with mongodb.
Week 3 : Implement Fragment, TableSpace and test them. Also implement required functions in supportive components.
Week 4 : Implement Scanner and test the reading capability from MongoDB database.
There will be around 7 weeks from mid evaluations(28th June) to the suggested 'pencil down' date(15th August)
Week 1 : Fix if there is any issue with the current implementation. Test the scanner.
Week 2 : Implementing the Appender.
Week 3 : Testing the appender and the complete tajo-storage-mongodb module. Start writing document.
Week 4 : Completing the document “MongoDB integration” in Tajo docs.
Week 5 : Testing all the functionalities of the driver, and create documentation on the architecture of the module.
Week 6 : Fix bugs and improve the quality of the code.
Week 7 : kept free for time flexibility in case of an emergency.
In addition I will be continuously blogging the work I do on my personal blog throughout the working period of the program.
5. Results for the Apache community
Result for the Apache Community will be the MongoDB support of tajo storage. Tajo users will be able to integrate MongoDb storages to Tajo cluster instances. MongoDB is a very popular database system therefore adding MongoDB to the tajo-storage will be really helpful to Tajo and the Apache Community.
6. Further Development
Tuning the storage plugin for better performance - After the implementation the code should be adjusted and tuned for better performance. For a example some projections/filtering can be push-downed to the MongoDB for better performance.
Implementing other storage plugins - When this is succeed I would be glad to continue and implement other storage plugins such as Apache Kudu, Apache Cassandra or Google Big Table.
7. About Me
Personal Details
Name: Janaka Chathuranga
Contact: +94713315725
GitHub: https://github.com/janakact
Twitter: http://twitter.com/janakact
Programming Background
My main interest is with C++ because it was the programming language I used to learn programming, but I have a good practice in Java too. Further I have self studied MongoDB. I strongly believe I have the skill set required to complete this project. I will be glad to research and study any other required technologies for the project.
8. Other commitments
- Semester 5 End Exams - 11th of July to 25th of July.
- I believe it will not be a big issue and I will be able to continue the project during this time.
- Part Time Tutoring - I do part time tutoring, 6 hrs per week.
9. Community Engagement
- Developers mailing list of Tajo(dev@tajo.apache.org) is used for for questions and discussions related to the development.
- It is better to start a thread and discuss before starting any important development.
- JIRA is used to manage the development process(For Tajo it is Agile)
- This proposal is on issue Add MongoDB to Tajo Storage - [TAJO-2079]
- It is a great place to descuss regarding examples, new features, sub-tasks, bugs. https://issues.apache.org/jira/browse/TAJO
- The main branch of Tajo will be updated by making a pull request at GitHub repository github.com/apache/tajo
- Wiki. https://cwiki.apache.org/confluence/display/TAJO
- Wiki will be a great source of information during the project time period.
- Further I created my proposal at wiki. If there is any question feel free to add a comment there. Add MongoDB to Tajo Storage - Proposal