Problem

we have two user profile datasets, we want to know how many records matched by user_id, first_name and last_name

Solution

we need to tell griffin where is the dataset and what is the assertion for matched.

Implementation

create one accuracy requirement in json, which griffin can understand.

{
  "name": "accu_batch",

  "process.type": "batch",

  "data.sources": [
    {
      "name": "src",
      "connectors": [
        {
          "type": "avro",
          "version": "1.7",
          "config": {
            "file.name": "users_info_src.avro"
          }
        }
      ]
    }, {
      "name": "tgt",
      "connectors": [
        {
          "type": "avro",
          "version": "1.7",
          "config": {
            "file.name": "users_info_target.avro"
          }
        }
      ]
    }
  ],

  "evaluate.rule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "rule": "src.user_id = tgt.user_id AND upper(src.first_name) = upper(tgt.first_name) AND src.last_name = tgt.last_name",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "metric": {
          "name": "accu"
        },
        "record": {
          "name": "missRecords"
        }
      }
    ]
  }
}

 

Deploy

with the measure module package, we can submit the job to spark cluster, to calculate the metric.

spark-submit --class org.apache.griffin.measure.Application \
--master yarn-client --queue default \
measure.jar \
env.json config.json local,local

 

Verification

after the calculation, the result will be persisted as configured in env.json

log, hdfs, http post, or mongodb are supported persist ways.

Next

 

 

 

 

  • No labels