2. Batch Accuracy Use Case Guide

Batch accuracy measures the accuracy between batch data sources. Currently supporting Hive table and Avro file.

Here is the guidance of using griffin batch accuracy measure part, using Hive table data.

Step 1. Prepare environment and data

Jar file: measure-0.1.6-incubating.jar

Environment requirement: Cluster of Hadoop, Spark, Hive.

Data requirement: Two tables in Hive, one of which is the data to be measured (called as source), and the other is single source of true (called as target).

The accuracy between these two table is calculated as: (count of source data matched with target) / (source data total count) * 100%

Step 2. Provide measure config file

config.json

{
  "name": "accuracy",
  "process.type": "batch",

  "data.sources": [
    {
      "name": "source",
      "connectors": [
        {
          "type": "hive",
          "version": "1.2",
          "config": {
          	"database": "default",
          	"table.name": "demo_src",
            "where": "dt=20180101 AND hr=12"
          }
        }
      ]
    },
    {
      "name": "target",
      "connectors": [
        {
          "type": "hive",
          "version": "1.2",
          "config": {
          	"database": "default",
          	"table.name": "demo_tgt",
            "where": "dt=20180101 AND hr=12, dt=20180101 AND hr=13"
          }
        }
      ]
    }
  ],
 
  "evaluateRule": {
    "rules": [
      {
        "dsl.type": "griffin-dsl",
        "dq.type": "accuracy",
        "rule": "src.user_id = tgt.user_id AND upper(src.first_name) = upper(tgt.first_name) AND src.last_name = tgt.last_name",
        "details": {
          "source": "src",
          "target": "tgt",
          "miss": "miss_count",
          "total": "total_count",
          "matched": "matched_count"
        },
        "metric": {
          "name": "accu"
        },
        "record": {
          "name": "missRecords"
        }
      }
    ]
  }
}

"name" is the name of this measurement.
"type" should be "accuracy" here.
"source" and "target" are the data source configurations.
"evaluateRule" is the evaluation rule of accuracy, "rules" describes the mapping rule of accuracy.

Step 3. Provide env config file

env.json

{
  "spark": {
    "log.level": "INFO",
    "config": {}
  },

  "persist": [
    {
      "type": "log",
      "config": {
        "max.log.lines": 10
      }
    },
    {
      "type": "hdfs",
      "config": {
        "path": "hdfs:///griffin/test",
        "max.lines.per.file": 10000
      }
    }
  ]
}

Here configured two persist types, the metric will be printed in the log, and metric and miss records will be persisted in hdfs.

Step 4. Submit spark job.

spark-submit --master yarn --queue default --class org.apache.griffin.measure.Application \
 measure-0.1.6-incubating.jar \
 <path of env.json> <path of config.json> "local,local"

The third parameter "local,local" indicates the position of env.json and config.json, it could be "local", "hdfs" or "raw" (which means the raw json string).

Step 5. Check the output file.

Check the metric file and miss records file in "hdfs:///griffin/test" configured in env.json.

Page tree

2. Batch Accuracy Use Case Guide