Page History

...

Consider the case where the end user is searching for additional sources for their project and the data that they need has not been provisioned into HDFS - it is still on the source systems. However, these data sources are already catalogued in another metadata repository. To be valuable, Apache Atlas's Catalog search needs to be able to cast its search to reach data and metadata repositories beyond Hadoop in order to locate all available data. Once the end user has identified interesting sources, they may then request that the data is provisioned into HDFS for further analysis. The VDC project will introduce the frameworks, integration and adapter capability to allow a more enterprise view of the potential data sources, plus a metadata driven connector framework for connecting to both data and metadata repositories. These frameworks are part of the open metadata and governance story.

...

Walk-through of the VDC use case

In the initial MVP for VDC, we are focusing on metadata replication between open metadata repositories to support the catalog query request. Later in 2017, we will add in the federated queries across metadata repositories to broaden the catalog search and potentially reduce the replication of metadata between the repositories.

Image Added

Figure 1: Catalog self service UI

Figure 1 shows a mock-up of the catalog search UI that the VDC supports. A person can enter search queries and a list of potential data sources are displayed on the left-hand side of the screen. Selecting one of the search results causes more details of the metadata for that entry to be displayed in the top right-hand side of the screen and underneath it, a preview of the data if the end user has permission to access the data.

At the start of the use case, details of the data repositories, the mappings to the business glossary terms and the security classifications are managed in IBM's Information Governance Catalog. This is shown in Figure 2.

Image Added

Figure 2: IBM's Information Governance Catalog (IGC) holding data lake metadata

The first step is to replicate the metadata from IGC to Apache Atlas so it can be extended to support the virtual views.

This is shown in Figure 3.

Image Added

Figure 3: Replicating metadata from IGC to Atlas

Since IGC remains the master copy of the original metadata, the replication must be ongoing so that Atlas remains up to date with the latest metadata from IGC.

Thus the replication capability listens for IGC events and converts them into OMRS events that can then be used to drive updates through the OMRS connector API to the Apache Atlas repository.

Image Added

Figure 4: Building information views with the virtualizer

The virtualizer is an optional component of Atlas that receives notifications from Apache Atlas through the Information View OMAS event topic and builds logical tables in Gaian as well and information view metadata in Atlas.

Gaian is an open source information virtualization technology. The virtualizer is written to be modular so calls to a different virtualization technology can be made at this point with a small change to the virtualizer.

The aim at the MVP is to prove out the user of Apache Atlas as a manager for an information virtualization technology.

Using a similar technique, the synchronization processes for Apache Ranger pick up knowledge from the Governance Action OMAS that the Information Views have been created/changed in Apache Atlas. They push appropriate metadata to control access to the Ranger server which then configures Ranger plugins in Gaian.

Image Added

Figure 5: Configuring enforcement points in Gaian using Apache Ranger

The Ranger plugins in Gaian cache all of the metadata they need to make access decisions based on the user information passed on a request.

The system is now configured. Changes to the IGC metadata will ripple through Atlas, Virtualizer, Ranger and Gaian so they are consistent and up-to-date.

Image Added

Figure 6: Requesting catalog information from Atlas

When the end user makes a search request, or clicks on a search result to see more detail, the request and response comes through the Catalog OMAS to Apache Atlas. See Figure 6.

When the data preview is requested, Gaian is called to extract the data. The Ranger plugins validate the access request allowing Gaian to retrieve the data from the data lake. See Figure 7.

Figure 8 summarizes the whole end-to-end flow

Image Added

Figure 7: Requesting data from Gaian

Image Added

Figure 8: VDC end-to-end flow (MVP1)

...

Page tree

Versions Compared

Old Version 4

New Version 5

Key

Walk-through of the VDC use case