Table of Content 

  1. Background
  2. Deliverables
  3. Implementation
  4. Timeline
  5. Results for the Apache community
  6. Further Development
  7. About me
  8. Other commitments
  9. Community Engagement

1. Project Background

Apache Tajo (Future of Data Warehouse) is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed to provide low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.(http://tajo.apache.org/) Tajo currently embeds HDFS, S3, Openstack, HBase, RDBMS storage plugins, so users can connect those other data sources to Apache Tajo.(http://tajo.apache.org/docs/current/storage_plugins/overview.html)

MongoDB is a open source, cross-platform document-oriented, NoSQL database. MongoDB eschews the traditional table-based relational database structure in favor of JSON-like documents with dynamic schemas (MongoDB calls the format BSON). Like other NoSQL database it supports dynamic schema design allowing documents in a collection to have different fields and structures. 

As mentioned in the first paragraph Apache Tajo embeds several storage plugins https://github.com/apache/tajo/tree/master/tajo-storage . The project propose to add a MongoDB storage plugin to tajo-storage. Implementing the new module tajo-storage-mongodb (storage plugin for MongoDB) will be the major part of the project.

2. Deliverables 

  1. Completed tajo-storage-mongodb module.

  2. Unit Testing - Test Code to check connectivity to a MongoDB database

  3. Maven Build Configuration for the new module

  4. “MongoDB Integration” tutorial page to the Apache Tajo docs.(Example:-https://tajo.apache.org/docs/current/hbase_integration.html)

  5. A documentation on module architecture

3. Implementation

The purpose of storage service is to connect to the underlying storage system and provide a clear interface to the upper layers of Tajo. 


Reference: http://www.slideshare.net/jihoonson/query-optimization-in-apache-tajo?next_slideshow=1

The tajo-storage-mongodb module will contain following important components.

  • MongodbTableSpace 

  • MongodbFragment 

  • MongodbAppender

  • MongodbScanner

  • TestCode to be used in Unit Testing

and it will contain other necessary  supportive components which are specific to mongodb. Those components will be used to handle database access. Java MongoDB Driver (https://docs.mongodb.org/ecosystem/drivers/java/)  will be used as the database driver in these implementations.

  • AbstractMongodbQueryExecutor

    •  This class will work as a Adapter to the mongo driver interface.

  • MongodbQueryExecutor ( The concrete class )

  • MongodbConnectionInfo

Other than implementing mongodb-storage module it will be required to update configuration modules to allow support for MongoDB . The project task will be to implement above modules.

Here is a diagram which describes the module implementation. (Please consider, it is a really abstract diagram. For a example those classes may implement interfaces which are not in the diagram.)



4. Timeline 

With the advises of Mentors (Jihoon SonJaeHwa Jung) I have already setup the development environment in a Ubuntu virtual machine. IDE used is intelliJ Idea. I am going follow the following schedule during the coding period and community bonding period. 

Community Bonding Period : Maintaining regular discussions with the mentors and working on the material and guidance they provide. Going through all the storage drivers again and study their architecture. Discussing on the most suitable architecture for the MongoDB storage driver with mentors.

There will be around 4 weeks from the start of coding(23th May) till the start of mid-term evaluations(20th June)

Week 1 : Finalize the architecture and complete class structure. Create dummy classes and methods without writing the actual implementation. Suitable class, attributes, method names will be decided at this step. It will be really helpful at the implementation process.

Week 2 : Implementing the actual code for MongodbConnectionInfo class and check the connectivity with mongodb.

Week 3 : Implementing Fragment, TableSpace and test them. Further it is required to implement required functions in supportive components to achieve this task.

Week 4 : Implementing Scanner and testing the reading capability from a MongoDB database.


There will be around 7 weeks from mid evaluations(28th June) to the suggested 'pencil down' date(15th August)

Week 1 : Fix if there is any issue with the current implementation. Test the scanner.

Week 2 : Implementing the Appender.

Week 3 : Testing the appender and the complete tajo-storage-mongodb module. Start writing document.

Week 4 : Completing the document “MongoDB integration” in Tajo docs.

Week 5 : Testing all the functionalities of the driver, and create documentation on the architecture of the module.

Week 6 : Fix bugs and improve the quality of the code.

Week 7 : Kept free for time flexibility in case of an emergency.

In addition I will be continuously blogging the work I do on my personal blog throughout the working period of the program.

5. Results for the Apache community

Result for the Apache Community will be the MongoDB support of tajo storage. Tajo users will be able to integrate MongoDb storages to Tajo cluster instances. MongoDB is a very popular database system therefore adding MongoDB to the tajo-storage will be really helpful to Tajo and the Apache Community.

6. Further Development

Tuning the storage plugin for better performance - After the implementation the code should be adjusted and tuned for better performance. For a example some projections/filtering can be push-downed to the MongoDB for better performance. 

Implementing other storage plugins - When this is succeed I would be glad to continue and implement other storage plugins such as Apache Kudu, Apache Cassandra or Google Big Table.

7. About Me

 Personal Details

Programming Background

My main interest is with C++ because it was the programming language I used to learn programming, but also I have a good practice in Java too. Further I have self studied MongoDB. I strongly believe I have the skill set required to complete this project. I will be glad to research and study any other required technologies for the project.

8. Other Commitments

  • Semester 5 End Exams - 11th of July to 25th of July.
    • I believe it will not be a big issue and I will be able to continue the project during this time.
  • Part Time Tutoring - I do part time tutoring, 6 hrs per week.

9. Community Engagement

  1. Developers mailing list of Tajo(dev@tajo.apache.org) is used for for questions and discussions related to the development.
    • It is better to start a thread and discuss before starting any important development.
  2. JIRA is used to manage the development process(For Tajo it is Agile)
  3. The main branch of Tajo will be updated by making a pull request at GitHub repository github.com/apache/tajo
  4. Wiki. https://cwiki.apache.org/confluence/display/TAJO
    • Wiki will be a great source of information during the project time period.
    • Further I created my proposal at wiki. If there is any question feel free to add a comment there. Add MongoDB to Tajo Storage - Proposal
  5. During the coding period I am planing to communicate with Mentors using Google Hangouts.

 

 

  • No labels