Area 6 - Discovery

Unable to render Jira issues macro, execution error.

Area 6 provides structures for automated metadata discovery servers to add annotations to assets in the metadata repository. Metadata discovery servers run different types of analysis. This analysis may run just once, say when the asset is created, on demand or based on an event or schedule. A particular type of analysis is implemented in a discovery service. Within the discovery service are one to many discovery steps. Each step performs some sort of analysis that may result in an annotation for one or more assets. The annotations from a particular run of a discovery service are grouped together into a discovery analysis report. The annotations may be reviewed and approved by a steward. The steward may convert the annotation to a harden metadata type, or they may flag the annotation as invalid. When the discovery service is rerun, the new annotations can be matched to the annotations from the previous run. The steward's actions will impact how the new annotations are processed.

Apache Atlas has an Open Discovery Framework (ODF) that supports the development and execution of discovery services. The ODF runs as a metadata discovery server. ODF discovery services use connectors from the Open Connector Framework (OCF) to connect to the data assets and access the known metadata about them.

Figure 1 shows the different packages for discovery

Figure 1: Area 6 packages for metadata discovery

Metadata discovery server

The metadata discovery server uses the metadata repository to manage its configuration and the configuration of the discovery services that are deployed to it.

Figure 2: metadata structures to store ODF configuration

Each metadata discovery server is represented in the metadata repository as a Server instance with a classification of "MetadataDiscoveryServer".

Each type of discovery service is represented as an instance of a ServiceCapability specialization called DiscoveryService.

These entities hold the configuration used by a metadata discovery server.

Discovery analysis reports

Each time a discovery service runs it creates a collection of annotations. These annotations are managed in a cache by the ODF to allow later steps in the discovery service to access annotations from the previous steps. When the service completes, the annotations are published to the metadata repository as a discovery analysis report. Notice they are linked both to the server and the discovery service since a discovery service may be deployed to multiple metadata discovery servers.

Figure 3: Results of a discovery service

The DiscoveryAnalysisReport is the report header. It identifies the date of the report and the parameters used. It may also include a name and description that is supplied by the initiator of the discovery service run.

The DiscoveryServerReport links the report to the metadata discovery server that ran the discovery service and the DiscoveryServiceReport links it to the discovery service.

Annotations

Annotations capture the discovered characteristics of an asset. They are created by the analysis steps in the discovery service. The attributes of the annotation capture the details of the discovery processing. The sub-classes of Annotation capture specific details of the discovered metadata. Each annotation is linked the the discovery analysis report it was generated from. It also links to each asset that the annotation relates to. Strings are used in many of the attributes to keep the model open for discovery service developers and the tools that process them.

Figure 4: Base structure for an annotation and its links to an asset and the discovery analysis report

annotationType - descriptive string that acts as an identifier for the specific annotation type. This is a simple means to sub-type any one of the annotation subclasses.
summary - a human readable string to describe the annotation.
confidence - an indicator of the certainty that the annotation is correct.
expression - this attribute is used to provide more detail on how the asset is related to the annotation.
explanation - another description field to assist human analysts reviewing the discovery results.
analysisStep - identifier of the step in the discovery service that detected the annotation.
jsonProperties - the properties that were used to initiate the discovery service.

Annotation Reviews

The annotations associated with an asset can be seen by people and tools querying the associated asset, servers or discovery service. However, often the analysis within a discovery service can only make recommendations based on the information within the asset. Where annotations refer to information that is used for governance, they need to be approved and converted into classifications, or related metadata. The Annotation review records how the discovered annotations have been actioned in the metadata server and the steward that approved it.

Figure 5: Recording annotation reviews

The types that follow provide more specialized annotations.

Schema Extraction

One of the simplest discovery processes for relational data is to extract the schema details from the asset through the JDBC connector getMetadata() API. Other connectors or data sources may also provide APIs for schema extraction. The schema is first added as an annotation. This is then either matched with an existing schema or a new schema is created (see area 5). This may be completely automated, or with stewards assistance.

Figure 6: Capturing the schema of an asset

Profiling

Profiling analysis looks at the data values in the asset and summarizes their characteristics.

Figure 7: Capturing the characteristics of the values in a data asset

Semantic Discovery

Semantic discovery is attempting to define the meaning of the data values in the asset.

Figure 8: Uncovering the meaning of data

Relationship Discovery

Relationship discovery identifies relationships between different assets (or parts of assets), such as 2 columns that have a foreign key relationship.

Figure 9: Uncovering relationships in data

Classification Discovery

Classification discovery adds suggestions for how the data could be classified. These annotations are the discovery engine equivalent of the Informal Tag shown in 0350 - Feedback in Area 3.

Figure 10: Suggested classifications

Measurements

Measurements capture a snapshot of the physical dimensions and activity levels at a particular moment in time.

Figure 11: Measurements

Request for Action

A request for action (RfA) is used to trigger the Governance action Framework (GAF). It is used when the discovery service performs a test on the data (such as a discovery rule) or has discovered an anomaly in the data landscape compared to its metadata that potentially needs a steward or a curator's action. The governance action framework is configured to respond to the requests for actions (RfAs).

Page tree