You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Next »

Griffin measure module needs to support multiple types of DQ measurements, the DSL also needs to support most of DQ requests.

In griffin old version, the griffin measure DSL is like the WHERE clause of SQL, so it can support accuracy well, but for profiling use case, it could not describe most of the DQ requests.

For example, in old version, DSL for accuracy use case is like this:

$source.id = $target.id AND $source.name = $target.name AND $source.age > $target.age

And for null value detection use case is like this:

$source.id = null

But for enum value detection use case, it have to work as multiple rule statements:

$source.color = "RED";
$source.color = "BLUE";
$source.color = "YELLOW";
$source.color NOT IN ("RED", "BLUE", "YELLOW")

That's really not easy to use it to describe profiling rules or other DQ requests.

Therefore, we want to support spark-sql syntax directly, as one of griffin DSL.

 

Griffin DSL of new version

In griffin job configure file, the "evaluateRule" field will be like this to support griffin DSL:

"evaluateRule": {
  "dsl.type": "griffin-dsl",
  "rules": [
    {
      "type": "accuracy",
      "rule": "source.name = target.name AND source.age = target.age"
    }
  ]
}

In the backend, griffin will translate the rules to spark-sql first, like the following rules.

And can also support spark-sql directly:

"evaluateRule": {
  "dsl.type": "spark-sql",
  "rules": [
    {
      "name": "miss.record",
      "rule": "SELECT source.name, source.age FROM source LEFT JOIN target ON coalesce(source.name, 'null') = coalesce(target.name, 'null') AND coalesce(source.age, 'null') = coalesce(target.age, 'null') WHERE (NOT (source.name IS NULL AND source.age IS NULL)) AND (target.name IS NULL AND target.age IS NULL)",
      "persist.type": "record"
    }, {
      "name": "miss.count",
      "rule": "SELECT COUNT(*) FROM miss",
      "persist.type": "metric"
    }, {
      "name": "total.count",
      "rule": "SELECT COUNT(*) FROM source",
      "persist.type": "metric"
    }
  ]
}

In the backend, griffin will execute the sql statements, and persist the results by the "persist.type" and "name".

 

Griffin Measure Process Design

For directly spark-sql type, we will collect data source first, calculate through spark sql, then generate the results.

For griffin-dsl type, we will translate to spark-sql type rules, then follow the spark-sql type process.

The griffin process will be as following:

Data connector will generate data frame from data source configuration.

Rule Adaptor will generate spark sql commands from griffin-dsl or spark-sql rules.

Spark SQL will be the calculation engine to execute the sql tasks.

Result generator will generate the results of DQ dimensions from spark sql results.

 

Conclusion

The spark sql engine will support more types of DQ requests, and be continuously supported by the spark community. Griffin will focus on the DQ dimensions, to support more DQ problem requirements, as well as the streaming processes and multiple data source types.

  • No labels