Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Design

Since OODT consists of different components like file manager, resource manager and workflow manager, all those components have their own configuration files and locations. This is complex to manage and creates problems when the platform is distributed across servers or geographically. Therefore, the objective of this project is to migrate the OODT configuration to an optional zookeeper module so that the OODT components can register themselves in the zookeeper ensemble accordingly and maintain each component’s state regardless of the scale of the cluster. The proposed zookeeper module will minimize the manual configuration required when configuring OODT components.

Introduction

Apache Object Oriented Data Technology (OODT) is an open source data management system framework which originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives. OODT provides three core components,

  1.  File Manager   -  Responsible for tracking file locations, their metadata and for transferring files from a staging area to controlled access storage

  2. Workflow Manager - Responsible for capturing data flow and control flow for complex processes and allowing for reproducibility and the construction of scientific pipelines.

  3. Resource Manager - Responsible for handling allocation of workflow tasks and other jobs to underlying resources.

Apart from that, OODT consists of several other components like file crawler (CAS-crawler), push/pull framework (CAS-push/pull) and CAS-PGE (Catalog and Archive Service Production Generation Executive).

In addition to the details given in the abstract, this module will make use of the inherited configuration at component level. For example consider the file manager. Almost all the configurations of file manager instances are identical. Therefore, new file managers which are coming up later will inherit the configuration of the initial file managers and will almost remove the manual configuration required when adding new nodes to the cluster.

Deliverables

  • Completed distributed configuration management module 

    • Implemented using Apache Zookeeper as the underlying distributed coordination system and Apache Curator as the client to connect to Zookeeper.

  • Unit tests and integration tests (if possible) for the new module

    • Tests to check the correct functionality of distributed configuration management module.

    • Integration tests with simulated OODT component behaviors.

  • Documentation on “How to use the distributed configuration management module”

    • By default, OODT will be using the file-based configuration (which is currently available). In order to enable the distributed configuration management module, users will have to do few configurations.

  • A developer documentation 

    • Explaining the architecture and the reasons for each design decision with corresponding design diagrams.

Design and Implementation

Module Architecture

Since file manager is the most critical component which has the requirement of distributed configuration management, I will be using that component as the reference module when describing the design.

OODT-Zookeeper-1.jpgImage Removed

 

As shown in the class diagram above, I will make use of the factory design pattern to get matching configuration manager of a component. There are two types of configuration managers, 

  1. Standalone configuration manager  - Which is created with the current logic of configuration management. This configuration manager will be using .properties files and such local files to fetch configuration.

  2. Distributed configuration manager  - Which is the new addition to store configuration in zookeeper and fetch them when needed. New components that are coming up on the fly will be able to fetch configuration from zookeeper and use them with no need to manually configure those configured every time.

ConfigurationFactory will return the corresponding configuration manager by looking at a system property. That is, if a user wants to use distributed configuration manager, he/she should set a system property (say org.apache.oodt.conf.zookeeper=true) indicating the component to use distributed configuration management, so that the ConfigurationFactory will return the DistributedConfigurationManager as the configuration manager. By calling the loadConfiguration() method of the returned configuration manager, system properties will be loaded. The ConfigurationFactory class will take the component name and configuration file names in getConfigurationManager() method. The component name will be used to identify similar components in the cluster and the configuration files will be used to fetch the configuration and to store them in zookeeper.

Later, this configuration manager class can be used to query the available components in the cluster and their configurations. This will also allow the developer to check which components are currently active and which are not (through ephemeral nodes in zookeeper). The ZNode structure in zookeeper is described in the next section.

Zookeeper (ZNode) Structure 

OODT-Zookeeper-2.jpgImage Removed

All the information related to OODT components will be stored under the ZNode /oodt and a separate ZNode will be created per each instance in the cluster. In the above structure, /oodt/node1 and /oodt/node2 are such examples where node1 and node2 are the names of two nodes in the cluster. Inside those node, separate ZNodes will be created as shown for each component that is running inside that instance. File manager (file-mgr) and resourcemanager (resmgr) in the above structure are therefore dedicated to the file manager and resource manager components of node1.

A separate module will be created for this configuration management implementations. As I have shown in the design, since I’m using an interface ConfigurationManager which will act as the API for configuration managers, this design can be further extended in future even to support other distributed databases and distributed coordination systems like etcd (https://coreos.com/etcd/). Any module/component that is willing to make use of this new configuration management mechanism should have a dependency to this module. As I have mentioned previously, a system property will determine whether that component will be using distributed or standalone implementation of configuration manager at runtime. I will be using Apache Curator as the client to connect to Apache Zookeeper.

What I have done so far

I have started implementing the design I have proposed. As the initial step, I have defined the ConfigurationManager API. While implementing, I understood that the two implementations of that interface share several major properties as well. Therefore, I have changed the ConfigurationManager to an abstract class. Furthermore, I have written the code to connect to the zookeeper ensemble through Apache Curator’s CuratorFramework class. All these work can be found at my OODT fork, https://github.com/IMS94/oodt. Looking at the code I have implemented so far, I think a person can resolve any ambiguity in the design I have given above.

I have added a separate module for OODT-Configuration Management. Therefore any module that is willing to use this feature should add this module as a dependency. Currently I have added this to file manager module.

Time line

 

...

28th February - 3rd April

...

  • Getting familiar with the OODT project.

  • Understanding how each component is configured.

  • Coming up with a draft design.

  • Writing proposal and refining the proposal based on feedback.

...

4th May

...

  • Accepted projects are announced

...

5th May - 30th May

...

  • Reviewing the design in detail with mentor. This includes cross validating the design and the actual requirement.

  • Improve the design to minimize manual configuration required and allowing components to register/deregister on the fly.

...

20th May – 26th June

...

  • Week 1 & 2

    • Implementing the core of the ConfigurationManager API will be the major task.

    • As I have already started on that, refining and implementing the logic inside DistributedConfigurationManager will be carried on.

  • Week 3 

    • Create proper test cases parallel to the implementation process in order to test the functionality of the DistributedConfigurationManager. Tests will be created using the curator-test package with test zookeeper clusters.

  • Week 4 & 5  

    • Review the architecture and the implemented classes to decide what are the functionalities of each class should be. Mentor’s feedback and suggestions will be mostly considered when deciding the functionalities that should be exposed by the APIs.

...

26th June - 30th June

...

  • Preparing for phase 1 evaluations.

  • Preparing the implemented functionalities and tests after finalizing them for the submission.

  • Submissions for phase 1 evaluations

...

30th June - 28th July

...

  • Week 1 & 2 

    • Adding more functionalities to concrete implementations of ConfigurationManager classes.

    • Adding proper state management to maintain consistency at the runtime.

    • Improve the API exposed by ConfigurationManager.

  • Week 3 

    • Implementing and improving test cases written in phase 1

    • Reviewing the implementation once again with the mentor to identify required further improvements.

  • Week 4

    • Submissions for phase 2 evaluations.

...

28th July - 27th August

...

  • Week 1 & 2 

    • Based on the improvements identified, do the required improvements to functionalities and APIs.

    • Continuously reviewing and cross validating the design, implementation and mentor’s objectives in order to converge the implementation to address the actual requirement.

    • In parallel, I will be starting on the “How to use” documentations and developer documentations while preparing the required design diagrams.

  • Week 3 

    • Validating the implementation with improved test cases (mostly integration tests similar to what I did in a previous week).

    • Validating the documentations with my mentor and improve based on his feedback.

  • Final Week

    • Mostly kept free to be used in case of emergency.

    • Refining and refactoring the code (if required) and documentation for the final submission.

    • Making the final submission

...

29th August

...

  • Project timeline ends with submission of deliverables.

About Me

I, Imesha Sudasingha (S.A.I.M. Sudasingha) am a final year undergraduate of University of Moratuwa, Sri Lanka. Learning new technologies, reading on latest technologies, applying my knowledge and concepts learned to real world applications and following best practices are the most outstanding characteristics of mine. Apart from that, I like working with different people so that I can gain more knowledge through their advices and experience. When it comes to developing a software/module, designing the architecture is the best phase I like to be in. I personally believe that the design of a software is the most critical thing when developing a software. The same attitude forced me to choose this project since this project requires a lot of designing and architectural decisions. Apart from that, my previous experience with apache zookeeper and distributed systems helped me to understand this project.

As I have done several open source contributions as explained later in this proposal, I wanted to do a larger contribution. Therefore, I decided that I want to contribute to Apache Software Foundation since I have been using many apache open source products/libraries (ex: Apache2 Server, Zookeeper, HttpComponents, Maven, Tomcat and Curator). This project took my attention at the first glance since this was related to zookeeper and java which are few of my most familiar technologies.

Experience in Zookeeper/Curator and Distributed Systems 

I have been in intern in AdroitLogic Lanka (PVT) Ltd (www.adroitlogic.com) where I wrote the entire cluster management module of their new product stack, project-x/ultraESB-x (https://www.adroitlogic.com/products/ultraesb/). That module included a distributed command framework written using the Zookeeper’s watcher mechanism and a failover support implementation as well. The documentation written for that module is available here. API of the module I wrote can be found here. The module I have written will replace the current failover support system of the 2nd most critical system in Singapore Stock Exchange. Furthermore, I have been writing on Apache Zookeeper and Apache Curator (Apache Curator in 5 minutes, Network Partitioning in Zookeeper).

Open source contributions 

I have contributed to Apache Curator twice (pull requests https://github.com/apache/curator/pull/175 and https://github.com/apache/curator/pull/177) where one of them was an improvement for their curator-test module to bind the test servers to other network interfaces other than to just localhost.

Other contributions

stackoverflow.com 

I’m an active user of stackoverflow where I have gained 2159 (as of 23rd of March, 2017) reputation within two years. Most of those reputation has been gained through giving answers. Having java and apache-zookeeper in my most popular tags in stackoverflow profile proves that I have a considerable knowledge in java and apache-zookeeper.

 Apache Zookeeper and Curator Mailing lists

I have been active in both Apache Curator and Zookeeper user and dev mailing lists for some time. I have been mostly asking questions on the design of Zookeeper when I was implementing the cluster management module at AdroitLogic (PVT) Ltd as mentioned above. Several mails I have sent in those mailing threads can be found in Apache Curator Mail Archives of November.

Other commitments during GSoC period

Usually I have lectures on 3 days per week. Therefore I have complete 4 days to involve in my own work. That was a main reason for me to apply for GSoC this summer as I thought of doing something useful within this period. Also I will be getting a 1 month long vacation in July. Because of all those reasons, can afford around 40-50 hours per week on my GSoC project from 30th May to 27th August when the coding of GSoC projects officially carried on.

Contact Information 

Image Removed LinkedIn  - www.linkedin.com/in/imeshasudasingha

Image Removed Github  - https://github.com/IMS94

Image Removed Stackoverflow - http://stackoverflow.com/users/4012073/imesha-sudasingha

Image RemovedTwitter  - https://twitter.com/Imesha94

Image RemovedMedium.com - https://medium.com/@Imesha94

Why Me?

As I have described throughout the proposal, I have lot of experience in Zookeeper and Apache Curator. According to my experience, most critical thing when working on distributed systems is handling the edge cases. That is, we don’t have to worry about the “Happy day scenario” but about the inconsistencies in networks and session handling.

Furthermore, I have a good idea on what needs to be done in this project and I think that is reflected on the proposed architecture and design. Please note that I have only added a draft design as well. Actual implementation will be more complex and consistent as I and my mentor will be reviewing each step to refine the outcome as much as possible.

I have been an active open source contributor and an active person on several Apache mailing lists. Therefore, I have a good understanding on how the Apache eco-system works and how the open source culture works. Based on all these reasons, I am confident that I can complete this project in the best possible manner, adding more value to the OODT project in future.

References 

...

Apache Zookeeper - https://zookeeper.apache.org

...

Apache Curator  - http://curator.apache.org

...

CoreOS etcd   - https://coreos.com/etcd/

Most of the configuration parameters and files are common for all the instances of the same OODT component. Therefore, following ZNode structure is adapted where configuration related to each individual component type is stored in a separate ZNode sub tree as shown below.

Image Added

The Idea

On a high level, there can be multiple projects running different sets of file managers, workflow managers and etc. To store configuration for different projects separately in zookeeper, the concept of projects are introduced and the root ZNode is divided to sub trees based on the project. Under each project there are configuration stored for different OODT components (file manager, resource manager, ...). Basically, there are two types of configuration files that needs to be stored, properties files and other configuration files (like XML files and etc). Out of those, properties within the properties files are only loaded once (normally at the initialization). Therefore, they are treated as an special case in contrast to other configuration files. In order to achieve this, two sub trees are created as shown above for properties files and other configuration files for each component. If you take the file manager (file-mgr ZNode), it has 2 child nodes as mentioned to store properties files and other configuration files.

What are the other child nodes (etc, policy) in the ZNode structure?

This is where the design comes into play. Suppose we have a mime-types.xml file for our file manager. If you have configured file manager manually, you may have seen that there are many directories within OODT distribution (file-mgr, res-mgr, workflow and etc.) Within these directories (say within file-mgr directory) there are another set of directories like bin, lib, policy and etc. As you assume, bin and lib include all the libraries and executables. etc includes the major configuration files (filemgr.properties) and policy also includes several other files required for configuration purposes (ex: cmd-line-actions.xml). Therefore, when we do distributed configuration management, we have to make sure that all the instances which are downloading configuration from zookeeper will get all of these properties files and configuration files and they will store them in the correct directories for corresponding components to pickup these files in run time.

In order to make sure that all the configuration files will be available within a predefined directory at runtime, we store each configuration file in a ZNode whose path is same as the path where that file should be at runtime. If we take the mime-types.xml file, it should be available (locally) within the ${FILEMGR_HOME}/etc/ directory. Therefore, to identify where the corresponding configuration file should be stored locally relative to ${FILEMGR_HOME}, we take the ZNode path from ZNode file-mgr/configuration/ as the storing location. Therefore, when being downloaded, content within the ZNode oodt/components/file-mgr/configuration/etc/mime-types.xml will be stored in ${FILEMGR_HOME}/etc/mim-types.xml file. That is the basic idea of how configuration will be published and how they will downloaded and stored.

DistributedConfigurationPublisher is responsible for publishing configuration to zookeeper initially. Once configuration has been published, any OODT component running in any cluster node can fetch them through DistributedConfigurationManager class. A CLI tool is available to publish/verify/clear configuration in zookeeper. To learn more on configuration publishing, please read the documentation on Distributed Configuration Management.

Future Developments

Extending distributed configuration management to a distributed command framework

At the moment, even with distributed configuration enabled:
  1. We have to login to a remote server
  2. Install/unpack corresponding OODT component
  3. Start it (with no manual configuration since configuration is downloaded on the fly). We need to set ZK_CONNECT_STRING environment variable prior to that.
  4. If we need to restart a component, then we have to login to that server as well.

If we can extend our zookeeper based configuration management to a command framework, we can simply restart/refresh the entire component or the configuration as required with just a simple terminal command in a local machine.

Introducing distributed configuration management to crawler and pcs packages

As per the moment, distributed configuration management only support 3 main components of OODT, file manager, resource manager and workflow manager. It would be great if this feature was introduced to above mentioned packages as well.

Allow file manager clients to query multiple file managers as one

Currently for file storage and data archiving there would have to be an NFS mount and stuff. Once file managers are configured, they are not aware of the other file managers operate in the cluster. If we can allow the file managers to know about each other, then we can extend that to clients being able to query a range of file managers as if they were one.

...