Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

No Format
"evaluateRule": {
  "rules": [
    {
      "dsl.type": "griffin-dsl",
  "rules": [
    {"dq.type": "accuracy",
      "typename": "accuracyaccu",
      "rule": "source.name = target.name AND source.age = target.age",
      "details": {
        "source": "source",
        "target": "target"
      },
      "metric": {
        "name": "accu"
      },
      "record": {
        "name": "missRecords"
      }
    }
  ]
}

In the backend, griffin will translate the rules to spark-sql first, like the following rules.

And can also support spark-sql directly:

No Format
"evaluateRuleevaluate.rule": {
  "rules": [
    {
      "dsl.type": "spark-sql",
  "rules": [
    {
      "name": "miss.recordmissRecords",
      "rule": "SELECT source.name, source.age* FROM source LEFT JOIN target ON coalesce(source.name, 'null') = coalesce(target.name, 'null') AND coalesce(source.age, 'null') = coalesce(target.age, 'null') WHERE (NOT (source.name IS NULL AND source.age IS NULL)) AND (target.name IS NULL AND target.age IS NULL)",
      "record": {
        "persist.typename": "recordmissRecords"
      }
    },
    {
      "dsl.type": "spark-sql",
      "name": "miss._count",
      "rule": "SELECT COUNTcount(*) as miss FROM miss",`missRecords`"
    },
    {
      "persistdsl.type": "spark-sql",
 "metric     "name": "total_count",
      "rule": "SELECT count(*) as total FROM source"
    },
    {
      "dsl.type": "spark-sql",
      "name": "total.countaccu",
      "rule": "SELECT COUNT(*) FROM source",`total_count`.`total` AS `total`, coalesce(`miss_count`.`miss`, 0) AS `miss`, (`total` - `miss`) AS `matched` FROM `total_count` FULL JOIN `miss_count`",
      "metric": {
        "persist.typename": "metricaccu"
      }
    }
  ] 
}

In the backend, griffin will execute the sql statements, and persist the results by the "persist.type" and "name"metric and record if configured.

 

Griffin Measure Process Design

...

Spark SQL will be the calculation engine to execute the sql tasks step by step.

Result generator will generate collect the results of DQ dimensions from spark sql results, including metric and records.

Metric is the result of DQ, such as {"total": 100, "miss": 2, "matched": 98} for accuracy, it is always small data, users can persist it in any way.

Record is the records during DQ calculation, such as the missing records of accuracy, it is always big data, by default it only persists on hdfs.

 

Conclusion

The spark sql engine will support more types of DQ requestsproblems, and be continuously supported by the spark community. Griffin will focus on the DQ dimensions, to support more DQ problem requirements, as well as the streaming processes and multiple data source types.