1. Accuracy Usage

Problem

we have two user profile datasets, we want to know how many records matched by user_id, first_name and last_name

Solution

we need to tell griffin where is the dataset and what is the assertion for matched.

Implementation

create one accuracy requirement in json, which griffin can understand.

{
  "name": "accu_batch",

  "process.type": "batch",

  "data.sources": [
    {
      "name": "src",
      "connectors": [
        {
          "type": "avro",
          "version": "1.7",
          "config": {
            "file.name": "users_info_src.avro"
          }
        }
      ]
    }, {
      "name": "tgt",
      "connectors": [
        {
          "type": "avro",
          "version": "1.7",
          "config": {
            "file.name": "users_info_target.avro"
          }
        }
      ]
    }
  ],

  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "rule": "src.user_id = tgt.user_id AND upper(src.first_name) = upper(tgt.first_name) AND src.last_name = tgt.last_name",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "metric": {
          "name": "accu"
        },
        "record": {
          "name": "missRecords"
        }
      }
    ]
  }
}

Deploy

with the measure module package, we can submit the job to spark cluster, to calculate the metric.

spark-submit --class org.apache.griffin.measure.Application \
--master yarn-client --queue default \
measure.jar \
env.json config.json local,local

Verification

after the calculation, the result will be persisted as configured in env.json

log, hdfs, http post, or mongodb are supported persist ways.

Page tree

Problem

Solution

Implementation

Deploy

Verification

Next