Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. To have an abstraction layer of query engine
    The current spark implementation limited Griffin's adoption, since a company could use other query engine engines in company level. It's too heavy to set up a spark environment to run Griffin.

  2. To support a common data quality workflow: measurecollect-evaluate-alert
    In most of real scenarios, measuring is not the goal, but measurecollect-evaluate-alert workflow.
      user should be able to define an use case, including:
        1. to measure collect a data quality metrics
        2. to evaluate the data quality trigger
        3. to define the alerting action
        4. to integrate the 3-steps into a single job/UoW (by scheduler)

...

  • During define phase, next generation architecture should use more expressive rules to define data quality requirements. SQL based rule is a good candidate for defining data quality, it is abstract but also concrete. It is abstract so that we can dispatch data quality rules to different query engines, it is concrete that all data quality stakeholders can understand the rules and align easily.
  • During define phase, the data quality should be uniformly defined among different scenarios such as batch, near realtime and realtime.
  • During measure phase, the next generation Griffin should standardize measure data quality pipelines to different stages as recording collecting stage, checking evaluating stage and alerting stage. It is easily for different data platform teams to integration with Griffin during different stages.
  • During measure collect phase, the next generation Griffin should not couple with any particular query engine, so it should able to dispatch/route requests to different query engineengines(spark, hive, flink, presto) upon different data quality rules or data catalogs.
  • During measure evaluate phase, the next generation Griffin should support different schedule strategies such as event trigger or time-based trigger.
  • During analyze evaluate phase, the next generation Griffin should provide standardize solutions as anomaly detection algorithm to detect anomaly, since in most cases, related stakeholders need our support to define anomaly.
  • Last but not least, the next generation Griffin should provides data quality reports/scorecards for different levels requirements. 

...