By this tutorial, you will be able to build griffin dev environment to go through all griffin data quality process as below

  • explore data assets,
  • create measures,
  • schedule measures,
  • execute measures in compute clusters and  emit metrics
  • navigate metrics in dashboard.

Dev dependencies

Java :

we prefer java 8, but java 7 is fine for us.

Maven : 

Prerequisities version is 3.2.5

Scala

Prerequisities version is 2.10

Angular

We are using 1.5.8

Node

We are using 6.0.0+

Bower

 npm install -g bower

 

Env dependencies

Hadoop

Prerequisites version is 2.6.0

Hive

Prerequisites version is 1.2.1

Spark

Prerequisites version is 1.6.x

Mysql

Prerequisites version is 5.0

Elastic search

Prerequisites version is 5.x.x

Make sure you can access your elastic search instance by http protocol.

 

Livy

Griffin submit jobs to spark by Livy( http://livy.io/quickstart.html )

#livy has one bug (https://issues.cloudera.org/browse/LIVY-94), so we need to put these three jars into hdfs
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
#Then you need to add some configuration in json when submit spark jobs through livy.
"jars": [ "hdfs:///livy/datanucleus-api-jdo-3.2.6.jar", "hdfs:///livy/datanucleus-core-3.2.10.jar", "hdfs:///livy/datanucleus-rdbms-3.2.9.jar" ]
 
#We need hive accessible in spark application, but there needs some configuration if submit through livy.
#livy.conf
livy.repl.enableHiveContext = true
#hive-site.xml should be put into hdfs, to be accessible by the spark cluster. For example, we can put it as below:
hdfs:///livy/hive-site.xml
#Then you need to add some configuration in json when submit spark jobs through livy.
"conf": { "spark.yarn.dist.files": "hdfs:///livy/hive-site.xml" }
#or like this
"files": [ "hdfs:///livy/hive-site.xml" ]

 

Setup Dev Env

Git clone

git clone https://github.com/apache/incubator-griffin.git

Project layout

There are three modules in griffin

measure : core algorithms for calculate metrics by different measure dimension.

#app
org.apache.griffin.measure.Application

 

service : web service for data assets, measure metadata, and job schedulers.

#spring boot app
org.apache.griffin.core.GriffinWebApplication

 

ui : front end 

 

Update several files to reflect your dev env

create a griffin working directory in hdfs
hdfs dfs -mkdir -p <griffin working dir>
init quartz tables by service/src/main/resources/Init_quartz.sql
mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql

 

update service/src/main/resources/application.properties
spring.datasource.url = jdbc:mysql://<MYSQL-IP>:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username = <user name>
spring.datasource.password = <password>

hive.metastore.uris = thrift://<HIVE-IP>:9083
hive.metastore.dbname = <hive database name>    # default is "default"
update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.
/*Please update as your elastic search instance*/
"api": "http://<ES-IP>:9200/griffin/accuracy"

update service/src/main/resources/sparkJob.properties file
sparkJob.file = hdfs://<griffin working directory>/griffin-measure.jar
sparkJob.args_1 = hdfs://<griffin working directory>/env.json
sparkJob.jars_1 = hdfs://<pathTo>/datanucleus-api-jdo-3.2.6.jar
sparkJob.jars_2 = hdfs://<pathTo>/datanucleus-core-3.2.10.jar
sparkJob.jars_3 = hdfs://<pathTo>/datanucleus-rdbms-3.2.9.jar
sparkJob.uri = http://<LIVY-IP>:8998/batches

 

update ui/js/services/service.js
#make sure you can access es by http
ES_SERVER = "http://<ES-IP>:9200"

Build

 

cd incubator-griffin
mvn clean install -DskipTests#cp jars to hdfd griffin working dircp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jarhdfs dfs -put griffin-measure.jar <griffin working dir>

Run

#Please find the service jar with version in target folder.
java -jar service/target/service.xxx.jar
#open from your browser
http://<YOUR-IP>://8080

 

Run with data prepared

  1. Click "Data Assets" at the top right corner, to watch all the exist data assets.
    We've prepared two data asset in https://github.com/apache/incubator-griffin/tree/master/docker/griffin_demo/prep/data, put those data into hive, you can see all the table metadata in Hive.

  2. Click "Measures" button at the top left corner to watch all the measures here, and you can also create a new DQ measurement by following steps.

    1. Click "Create Measure" button at the top left corner, choose the top left block "Accuracy", at current we only support accuracy type.
    2. Choose Source: find "demo_src" in the left tree, select some or all attributes in the right block, click "Next".
    3. Choose Target: find "demo_tgt" in the left tree, select the matching attributes with source data asset in the right block, click "Next".
    4. Mapping Source and Target: select "Source Fields" of each row, to match the corresponding field in target table, e.g. id maps to id, age maps to age, desc maps to desc.
      Finish all the mapping, click "Next".
    5. Fill out the required table as required, "Organization" is the group of this measurement.
      Submit and save, you can see your new DQ measurement created in the measures list.
  3. Now you've created a new DQ measurement, the measurement needs to be scheduled to run in the docker container. Click "Jobs" button to watch all the jobs here, at current there is no job, you need to create a new one. Click "Create Job" button at the top left corner, fill out all the blocks as below.

    "Source Partition": YYYYMMdd-HH
    "Target Partition": YYYYMMdd-HH
    "Measure Name": <choose the measure you just created>
    "Start After(s)": 0
    "Interval": 300
    

    The source and target partition means the partition pattern of the demo data, which is based on timestamp, "Start After(s)" means the job will start after n seconds, "Interval" is the interval of job, the unit is second. In the example above, the job will run every 5 minutes.

    Wait for about 1 minute, after the calculation, results would be published to web UI, then you can watch the dashboard by clicking "DQ Metrics" at the top right corner.

License Header File

Each source file should include the following Apache License header

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

 

 

 

 

 

  • No labels

12 Comments

  1. Hi William:
    Is the version of Hive must be older then 1.2.1? When I use hive-1.1.0-cdh5.8.2,I can submit jobs to yarn cluster but the error says tables doesn't exist . 

    thanks.

     

    1. could you show us the error log?

       

       

      1. Hi William:

        The error log is  as follows:

        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xy180-wecloud-85:46076 (size: 1725.0 B, free: 1060.0 MB)
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
        18/01/05 18:01:11 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1145 ms on xy180-wecloud-85 (1/1)
        18/01/05 18:01:11 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
        18/01/05 18:01:11 INFO DAGScheduler: ResultStage 0 (collect at RuleAdaptorGroup.scala:41) finished in 1.152 s
        18/01/05 18:01:11 INFO DAGScheduler: Job 0 finished: collect at RuleAdaptorGroup.scala:41, took 1.365230 s
        18/01/05 18:01:11 INFO Application$: process init success
        18/01/05 18:01:11 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.20.180.89:46719 in memory (size: 1725.0 B, free: 1036.5 MB)
        18/01/05 18:01:11 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xy180-wecloud-85:46076 in memory (size: 1725.0 B, free: 1060.0 MB)
        18/01/05 18:01:11 INFO ContextCleaner: Cleaned accumulator 1
        18/01/05 18:01:11 INFO HiveBatchDataConnector: SELECT * FROM wedw_dw.expert_std_df
        18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_table from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
        18/01/05 18:01:11 INFO HiveMetaStore: 0: get_table : db=wedw_dw tbl=expert_std_df
        18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_table : db=wedw_dw tbl=expert_std_df
        18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_table start=1515146471675 end=1515146471689 duration=14 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=-1 error=true>
        18/01/05 18:01:11 ERROR HiveBatchDataConnector: load hive table wedw_dw.expert_std_df fails: Table not found: `wedw_dw`.`expert_std_df`; line 1 pos 22
        18/01/05 18:01:11 WARN DataSource: load data source [source] fails
        18/01/05 18:01:11 INFO HiveBatchDataConnector: SELECT * FROM wedw_dwd.expert_basic_dz
        18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_table from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
        18/01/05 18:01:11 INFO HiveMetaStore: 0: get_table : db=wedw_dwd tbl=expert_basic_dz
        18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_table : db=wedw_dwd tbl=expert_basic_dz
        18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_table start=1515146471725 end=1515146471726 duration=1 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=-1 error=true>
        18/01/05 18:01:11 ERROR HiveBatchDataConnector: load hive table wedw_dwd.expert_basic_dz fails: Table not found: `wedw_dwd`.`expert_basic_dz`; line 1 pos 23
        18/01/05 18:01:11 WARN DataSource: load data source [target] fails
        18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_tables from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
        18/01/05 18:01:11 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
        18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_tables: db=default pat=.*
        18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_tables start=1515146471849 end=1515146471865 duration=16 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=0 error=false>
        18/01/05 18:01:11 INFO Application$: process run success
        18/01/05 18:01:12 INFO SparkUI: Stopped Spark web UI at http://10.20.180.89:54831
        18/01/05 18:01:12 INFO YarnClusterSchedulerBackend: Shutting down all executors
        18/01/05 18:01:12 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
        18/01/05 18:01:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
        18/01/05 18:01:12 INFO MemoryStore: MemoryStore cleared
        18/01/05 18:01:12 INFO BlockManager: BlockManager stopped
        18/01/05 18:01:12 INFO BlockManagerMaster: BlockManagerMaster stopped
        18/01/05 18:01:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!

        1. Where do you get the table  wedw_dwd.expert_basic_dz? 

          By UI or you specify the table in json by manually?

          Lionel Liu could you look at this?

           

          Thanks,

          William

           

          1. Hi William,

            Both ways i have try , the errors are same.I can get the table by Hive or SparkSql.

            thanks,

            Neil

            1. Hi Neil,

              You mean you've tried the way of submitting the job directly using measure.jar like this https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure-batch-sample.md right? It runs by spark-submit command directly, not through livy.

              The issue seems like in spark application it can not access your hive tables. We'll have something to check.

              First is the spark context has generated the hive context as sql context successfully, you can find that in log, if fails, you need to confirm the spark can access hive-site.xml. 

              Second is spark should be able to access hive, you can check it in spark-shell, to test if it can access this specific table.

              Third, if you can access this hive table in spark-shell, you can try to submit griffin job by spark-submit command directly, it should perform the same as spark-shell way. Actually, if you submit spark job through livy, it runs in cluster mode, not in client mode, the hive-site.xml need to be in hdfs, and in livy conf file should set livy.repl.enableHiveContext = true, like this: https://github.com/bhlx3lyx7/griffin-docker/blob/master/griffin_env/conf/livy/livy.conf


              Hope this can help you.


              Thank,

              Lionel Liu

               

              1. Hi Lionel,

                Thanks for your reply.

                I have  submit griffin job by spark-submit command successful, in this way i need set '--files <hdfs://hive-site.xml>'.If i want to submit jobs by using Griffin Web UI and in spark-cluster mode, how to config the hive-site.xml?

                Another question,if the partition field of the table in hive is not 'dt'  , how can i change the field by using Griffin Web UI ? I can change it by change the config.json , but it seems not convenient.

                Thanks

                Neil Wang

                 

                1. Hi Lionel,

                  Please ignore the first question,i have resolve it by put the hive-site.xml into the hadoop's config folder.

                  thanks

                  1. Hi Neil,

                    Your question hits the point. The web UI can only deal with 'dt' and 'hour' partition, it can not be a solution. We've enhanced scheduler in service and web UI, you can see it in the next version by this month.

                     

                    Thanks,

                    Lionel Liu

                    1. Hi Lionel,

                      Our team is be interested in Griffin and we will look forward to the next verison.

                      To test the web UI, I create two tables are partitioned by 'dt' and 'hour'. I can see the result after computing  in ES, but can't see the dashboard.Can you offer me some suggestion?

                      Thanks,

                      Neil Wang


                      1. Great, I think you've mostly succeed. For the latest version 0.1.6, we didn't update UI part, so it's not completely fit for the backend, it get metrics from ES directly, so the client need to access ES by modifying the ES_SERVER in service.service.ts.

                        We will update UI for our new version, the metrics will be fetched through service module, that will be a better solution.

                      2. Hi neil wang,

                        Griffin is going to release new version 0.2.0, with job scheduling process in UI, and some enhancement of measure engine.

                        You could try our new docker image `bhlx3lyx7/svr_msr:0.2.0` as this guide document https://github.com/apache/incubator-griffin/blob/master/griffin-doc/docker/griffin-docker-guide.md

                        Your team might have some requirement of data quality, what's your use cases? I think we can have a talk about it. Would you like to schedule a meeting with us or talk in email through dev-list: dev@griffin.incubator.apache.org ?

                        Thanks,

                        Lionel Liu