You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 15 Next »

What Griffin should focus

Griffin is a generic framework to enable user to measure and monitor the data quality in an easy and extensive manner.

Old Griffin architecture problems

What's the pain points we are facing in the current edition of Griffin from the architectural perspective?

  1. To have an abstraction layer of query engine
    The current spark implementation limited Griffin's adoption, since a company could use other query engine in company level. It's too heavy to set up a spark environment to run Griffin.

  2. To support a common data quality workflow: measure-evaluate-alert
    In most of real scenarios, measuring is not the goal, but measure-evaluate-alert workflow.
      user should be able to define an use case, including:
        1. to measure a data quality metrics
        2. to evaluate the data quality trigger
        3. to define the alerting action
        4. to integrate the 3-steps into a single job/UoW (by scheduler)

Next generation Griffin architecture considerations

As one mission for Griffin is to reduce MTTD(mean-time-to-detect), 

  • During define phase, next generation architecture should use more expressive rules to define data quality requirements. SQL based rule is a good candidate for defining data quality, it is abstract but also concrete. It is abstract so that we can dispatch data quality rules to different query engines, it is concrete that all data quality stakeholders can understand the rules and align easily.
  • During define phase, the data quality should be uniformly defined among different scenarios such as batch, near realtime and realtime.
  • During measure phase, the next generation Griffin should standardize measure pipelines to different stages as recording stage, checking stage and alerting stage. It is easily for different data platform teams to integration with Griffin during different stages.
  • During measure phase, the next generation Griffin should not couple with any particular query engine, so it should able to dispatch requests to different query engine(spark, hive, flink, presto) upon different data quality rules.
  • During measure phase, the next generation Griffin should support different schedule strategies such as event trigger or time-based trigger.
  • During analyze phase, the next generation Griffin should provide standardize solutions as anomaly detection algorithm to detect anomaly, since in most cases, related stakeholders need our support to define anomaly.
  • Last but not least, the next generation Griffin should provides data quality reports/scorecards for different levels requirements. 

Next generation Griffin architecture proposal




  • No labels