Griffin measure module needs to support multiple types of DQ measurements, the DSL also needs to support most of DQ requests.

In griffin old version, the griffin measure DSL is like the WHERE clause of SQL, so it can support accuracy well, but for profiling use case, it could not describe most of the DQ requests.

For example, in old version, DSL for accuracy use case is like this:

$source.id = $target.id AND $source.name = $target.name AND $source.age > $target.age

And for null value detection use case is like this:

$source.id = null

But for enum value detection use case, it have to work as multiple rule statements:

$source.color = "RED";
$source.color = "BLUE";
$source.color = "YELLOW";
$source.color NOT IN ("RED", "BLUE", "YELLOW")

That's really not easy to use it to describe profiling rules or other DQ requests.

Therefore, we want to support spark-sql syntax directly, as one of griffin DSL.

 

Griffin DSL of version 0.2.0

In griffin job configure file, the "evaluateRule" field will be like this to support griffin DSL:

"evaluateRule": {
  "rules": [
    {
      "dsl.type": "griffin-dsl",
      "dq.type": "accuracy",
      "name": "accu",
      "rule": "source.name = target.name AND source.age = target.age",
      "details": {
        "source": "source",
        "target": "target"
      },
      "metric": {
        "name": "accu"
      },
      "record": {
        "name": "missRecords"
      }
    }
  ]
}

In the backend, griffin will translate the rules to spark-sql first, like the following rules.

And can also support spark-sql directly:

"evaluate.rule": {
  "rules": [
    {
      "dsl.type": "spark-sql",
      "name": "missRecords",
      "rule": "SELECT source.* FROM source LEFT JOIN target ON coalesce(source.name, '') = coalesce(target.name, '') AND coalesce(source.age, '') = coalesce(target.age, '') WHERE (NOT (source.name IS NULL AND source.age IS NULL)) AND (target.name IS NULL AND target.age IS NULL)",
      "record": {
        "name": "missRecords"
      }
    },
    {
      "dsl.type": "spark-sql",
      "name": "miss_count",
      "rule": "SELECT count(*) as miss FROM `missRecords`"
    },
    {
      "dsl.type": "spark-sql",
      "name": "total_count",
      "rule": "SELECT count(*) as total FROM source"
    },
    {
      "dsl.type": "spark-sql",
      "name": "accu",
      "rule": "SELECT `total_count`.`total` AS `total`, coalesce(`miss_count`.`miss`, 0) AS `miss`, (`total` - `miss`) AS `matched` FROM `total_count` FULL JOIN `miss_count`",
      "metric": {
        "name": "accu"
      }
    }
  ] 
}

In the backend, griffin will execute the sql statements, and persist the metric and record if configured.


Griffin DSL of version 0.2.1

In dq config file of griffin in new version 0.2.1, there're some modification in the "evaluateRule" field, wrapping "metric" and "record" filed inside "out" array:

"evaluateRule": {
  "rules": [
    {
      "dsl.type": "griffin-dsl",
      "dq.type": "accuracy",
      "name": "accu",
      "rule": "source.name = target.name AND source.age = target.age",
      "details": {
        "source": "source",
        "target": "target"
      },
      "out": [
        {
          "type": "metric",
          "name": "accu"
        },
        {
          "type": "record",
          "name": "missRecords"
        }
      ]
    }
  ]
}

 

Griffin Measure Process Design

For directly spark-sql type, we will collect data source first, calculate through spark sql, then generate the results.

For griffin-dsl type, we will translate to spark-sql type rules, then follow the spark-sql type process.

The griffin process will be as following:

Data connector will generate data frame from data source configuration.

Rule Adaptor will generate spark sql commands from griffin-dsl or spark-sql rules.

Spark SQL will be the calculation engine to execute the sql tasks step by step.

Result generator will collect the results of DQ dimensions from spark sql results, including metric and records.

Metric is the result of DQ, such as {"total": 100, "miss": 2, "matched": 98} for accuracy, it is always small data, users can persist it in any way.

Record is the records during DQ calculation, such as the missing records of accuracy, it is always big data, by default it only persists on hdfs.

 

Conclusion

The spark sql engine will support more types of DQ problems, and be continuously supported by the spark community. Griffin will focus on the DQ dimensions, to support more DQ problem requirements, as well as the streaming processes and multiple data source types.

  • No labels

6 Comments

  1. LGTM, Will sync with team tomorrow.

  2. oops,the pictures are dead.

  3. Hi Neil,

    I can see the image, maybe you need to try to access it through some proxy?

    Thanks,

    William

  4. Hi William,

    I have used proxy and other peoples also can't see the image.Please have a check.

    thanks.

    1. Hi Neil,

      I have updated the images, pls have another try.

      Thanks.