Griffin measure module reads from data sources, calculates as rule described, then persists the metrics.

Measure of batch data

For batch data source, griffin measure process is as simple as the main flow.

In batch mode, we firstly read data from data source through data connector, then calculate DQ as rules described, after calculation, the metrics are persisted.

For example, in the batch accuracy case, griffin read data from Hive, calculate as rules on spark cluster, then the metrics are persisted into elasticsearch.

 

Measure of streaming data

For streaming data source, griffin measure process is also based on this main flow, with some more parts as following.

For streaming data source, we might be unable to process as fast as the data comes, so we need a temporary repository of data.

We firstly dump the streaming data as batch ones, with an info cache to cache the dump info, the next steps are just like batch mode.

The streaming job is split into batch jobs, which are calculated in DQ calculator, considering the latency time in streaming job, sometimes we need to dump the updated batch data for the next batch job, to guarantee that there's no leakage in calculation for streaming data.

The updated batch data is the data needs to be recalculated in the next batch job, this always happens in the multiple streaming data source condition, for example the accuracy between data sources both from kafka.

For example, in the streaming accuracy case, griffin read data from kafka, dump them into temporary Hdfs, split into batch jobs, these batch jobs just calculate the batches of data from Hdfs, then we can separate the consuming task and calculating task, avoid the delay influence by calculation. In every batch job, we calculat the newly received data in this batch period and the updated data of old batches which are not out time, in this way, we can guarantee the streaming data is calculated for sufficient time, only the out time data is discarded.

  • No labels