Area 6 provides structures for automated metadata discovery servers to add annotations to assets in the metadata repository. Metadata discovery servers run different types of analysis. This analysis may run just once, say when the asset is created, on demand or based on an event or schedule. A particular type of analysis is implemented in a discovery service. Within the discovery service are one to many discovery steps. Each step performs some sort of analysis that may result in an annotation for one or more assets. The annotations from a particular run of a discovery service are grouped together into a discovery analysis report. The annotations may be reviewed and approved by a steward. The steward may convert the annotation to a harden metadata type, or they may flag the annotation as invalid. When the discovery service is rerun, the new annotations can be matched to the annotations from the previous run. The steward's actions will impact how the new annotations are processed.
Apache Atlas has an Open Discovery Framework (ODF) that supports the development and execution of discovery services. The ODF runs as a metadata discovery server. ODF discovery services use connectors from the Open Connector Framework (OCF) to connect to the data assets and access the known metadata about them.
Figure 1 shows the different packages for discovery
Figure 1: Area 6 packages for metadata discovery |
Metadata discovery server
The metadata discovery server uses the metadata repository to manage its configuration and the configuration of the discovery services that are deployed to it.
Figure 2: metadata structures to store ODF configuration |
Each metadata discovery server is represented in the metadata repository as a Server instance with a classification of "MetadataDiscoveryServer". Each type of discovery service is represented as an instance of a ServiceCapability specialization called DiscoveryService. These entities hold the configuration used by a metadata discovery server.
|
Discovery analysis reports
Each time a discovery service runs it creates a collection of annotations. These annotations are managed in a cache by the ODF to allow later steps in the discovery service to access annotations from the previous steps. When the service completes, the annotations are published to the metadata repository as a discovery analysis report. Notice they are linked both to the server and the discovery service since a discovery service may be deployed to multiple metadata discovery servers.
Figure 3: Results of a discovery service | The DiscoveryAnalysisReport is the report header. It identifies the date of the report and the parameters used. It may also include a name and description that is supplied by the initiator of the discovery service run. The DiscoveryServerReport links the report to the metadata discovery server that ran the discovery service and the DiscoveryServiceReport links it to the discovery service. |
Annotations
Annotations capture the discovered characteristics of an asset. They are created by the analysis steps in the discovery service. The attributes of the annotation capture the details of the discovery processing. The sub-classes of Annotation capture specific details of the discovered metadata. Each annotation is linked the the discovery analysis report it was generated from. It also links to each asset that the annotation relates to. Strings are used in many of the attributes to keep the model open for discovery service developers and the tools that process them.
Figure 4: Base structure for an annotation and its links to an asset and the discovery analysis report |
|
Annotation Reviews
The annotations associated with an asset can be seen by people and tools querying the associated asset, servers or discovery service. However, often the analysis within a discovery service can only make recommendations based on the information within the asset. Where annotations refer to information that is used for governance, they need to be approved and converted into classifications, or related metadata. The Annotation review records how the discovered annotations have been actioned in the metadata server and the steward that approved it.
Figure 5: Recording annotation reviews |
The types that follow provide more specialized annotations.
Schema Extraction
One of the simplest discovery processes for relational data is to extract the schema details from the asset through the JDBC connector getMetadata() API. Other connectors or data sources may also provide APIs for schema extraction. The schema is first added as an annotation. This is then either matched with an existing schema or a new schema is created (see area 5). This may be completely automated, or with stewards assistance.
Figure 6: Capturing the schema of an asset |
Profiling
Profiling analysis looks at the data values in the asset and summarizes their characteristics.
Figure 7: Capturing the characteristics of the values in a data asset |
Semantic Discovery
Semantic discovery is attempting to define the meaning of the data values in the asset.
Figure 8: Uncovering the meaning of data |
Relationship Discovery
Relationship discovery identifies relationships between different assets (or parts of assets), such as 2 columns that have a foreign key relationship.
Figure 9: Uncovering relationships in data |
Classification Discovery
Classification discovery adds suggestions for how the data could be classified. These annotations are the discovery engine equivalent of the Informal Tag shown in 0350 - Feedback in Area 3.
Figure 10: Suggested classifications |
Measurements
Measurements capture a snapshot of the physical dimensions and activity levels at a particular moment in time.
Figure 11: Measurements |
Request for Action
A request for action (RfA) is used to trigger the Governance action Framework (GAF). It is used when the discovery service performs a test on the data (such as a discovery rule) or has discovered an anomaly in the data landscape compared to its metadata that potentially needs a steward or a curator's action. The governance action framework is configured to respond to the requests for actions (RfAs).