Griffin measure module needs to support multiple types of DQ measurements, the DSL also needs to support most of DQ requests.
In griffin old version, the griffin measure DSL is like the WHERE clause of SQL, so it can support accuracy well, but for profiling use case, it could not describe most of the DQ requests.
For example, in old version, DSL for accuracy use case is like this:
$source.id = $target.id AND $source.name = $target.name AND $source.age > $target.age
And for null value detection use case is like this:
$source.id = null
But for enum value detection use case, it have to work as multiple rule statements:
$source.color = "RED"; $source.color = "BLUE"; $source.color = "YELLOW"; $source.color NOT IN ("RED", "BLUE", "YELLOW")
That's really not easy to use it to describe profiling rules or other DQ requests.
Therefore, we want to support spark-sql syntax directly, as one of griffin DSL.
Griffin DSL of new version
In griffin job configure file, the "evaluateRule" field will be like this to support griffin DSL:
"evaluateRule": { "dsl.type": "griffin-dsl", "rules": [ "$source.name = $target.name AND $source.age = $target.age" ] }
In the backend, griffin will translate the rules to spark-sql first, like the following rules.
And can also support spark-sql directly:
"evaluateRule": { "dsl.type": "spark-sql", "rules": [ "SELECT COUNT(*) FROM $source LEFT JOIN $target ON coalesce($source.name, 'null') = coalesce($target.name, 'null') AND coalesce($source.age, 'null') = coalesce($target.age, 'null') WHERE (NOT ($source.name IS NULL AND $source.age IS NULL)) AND ($target.name IS NULL AND $target.age IS NULL)", "SELECT COUNT(*) FROM $source" ] }
In the backend, griffin will replace $source and $target to the real table names, and generate the spark-sql statements.
Griffin Measure Process Design
For directly spark-sql type, we will collect data source first, calculate through spark sql, then generate the results.
For griffin-dsl type, we will translate to spark-sql type rules, then follow the spark-sql type process.
The griffin process will be as following:
Data connector will generate data frame from data source configuration.
Rule Adaptor will generate spark sql commands from griffin-dsl or spark-sql rules.
Spark SQL will be the calculation engine to execute the sql tasks.
Result generator will generate the results of DQ dimensions from spark sql results.
Conclusion
The spark sql engine will support more types of DQ requests, and be continuously supported by the spark community. Griffin will focus on the DQ dimensions, to support more DQ problem requirements, as well as the streaming processes and multiple data source types.