You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »


The virtual data connector (VDC) project is managed by   Unable to render Jira issues macro, execution error.

It supports two basic use cases:

  • An end user wishes to find some interesting data by looking in the Apache Atlas Metadata Catalog.
  • When they have found the data source they want, they wish to preview its data values to verify that it really is the data they need.

These use cases seem simple but they raise three very important questions that creates an explosion of requirements in Apache Atlas.  The first question is:

What metadata is required to describe the data sources in such a way that the end user can accurately locate the data sources they need - assuming they are not familiar with the content and organization of the data sources?

Typically the end user would want to use meaningful business terms to describe the data they need, they may want so see related descriptions of the data and the profile of its data values and its lineage.  Other information about the owners/stewards of the data and the organization they come from, and any license associated with the data would also be relevant.  To provide this information, the VDC project needs to expand the types defined in Apache Atlas; expand out the capability of the glossary so it supports categories and other types of semantic relationships to help the end user locate the right data; provide a new catalog API and interface for discovery of data based on these values.

The second question is:

What is the security model that determines which metadata and data that each end user can see?

Specifically, how should access be controlled - particularly in a self-service, data exploration environment where data is sources from many different systems and organizations need to be access in order to discover new uses and interesting patterns in the data.   In the VDC project we are providing a single endpoint for accessing data (this is the virtual data connector itself) that uses an Apache Ranger plugin to control access.  This control expands on the tag-based security access introduced in Apache Atlas release 0.7 in order to provide security access based on both the confidentiality classification tags (eg PII and SPI tags) and the subject area of the data.  There is an additional plug-in that is added to Apache Atlas to control access to metadata based on whether an end-user is allows to discover a data sources metadata.

Finally, the third question is:

Where is the metadata and the data actually stored?

Consider the case where the end user is searching for additional sources for their project and the data that they need has not been provisioned into HDFS - it is still on the source systems.   However, these data sources are already catalogued in another metadata repository.  To be valuable, Apache Atlas's Catalog search needs to be able to cast it search to reach data and metadata repositories beyond Hadoop in order to locate all available data.   Once the end user has identified interesting sources, they may then request that the data is provisioned into HDFS for further analysis.  The VDC project will introduce the frameworks, integration and adapter capability to allow a more enterprise view of the potential data sources, plus a metadata driven connector framework for connecting to both data and metadata repositories.

 


 

 

 

 

  • No labels