By this tutorial, you will be able to build griffin dev environment to go through all griffin data quality process as below
- explore data assets,
- create measures,
- schedule measures,
- execute measures in compute clusters and emit metrics
- navigate metrics in dashboard.
Dev dependencies
Java :
we prefer java 8, but java 7 is fine for us.
Maven :
Prerequisities version is 3.2.5
Scala
Prerequisities version is 2.10
Angular
We are using 1.5.8
Node
We are using 6.0.0+
Bower
npm install -g bower
Env dependencies
Hadoop
Prerequisites version is 2.6.0
Hive
Prerequisites version is 1.2.1
Spark
Prerequisites version is 1.6.x
Mysql
Prerequisites version is 5.0
Elastic search
Prerequisites version is 5.x.x
Make sure you can access your elastic search instance by http protocol.
Livy
Griffin submit jobs to spark by Livy( http://livy.io/quickstart.html )
#livy has one bug (https://issues.cloudera.org/browse/LIVY-94), so we need to put these three jars into hdfs datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar #Then you need to add some configuration in json when submit spark jobs through livy. "jars": [ "hdfs:///livy/datanucleus-api-jdo-3.2.6.jar", "hdfs:///livy/datanucleus-core-3.2.10.jar", "hdfs:///livy/datanucleus-rdbms-3.2.9.jar" ] #We need hive accessible in spark application, but there needs some configuration if submit through livy. #livy.conf livy.repl.enableHiveContext = true #hive-site.xml should be put into hdfs, to be accessible by the spark cluster. For example, we can put it as below: hdfs:///livy/hive-site.xml #Then you need to add some configuration in json when submit spark jobs through livy. "conf": { "spark.yarn.dist.files": "hdfs:///livy/hive-site.xml" } #or like this "files": [ "hdfs:///livy/hive-site.xml" ]
Setup Dev Env
Git clone
git clone https://github.com/apache/incubator-griffin.git
Project layout
There are three modules in griffin
measure : core algorithms for calculate metrics by different measure dimension.
#app org.apache.griffin.measure.Application
service : web service for data assets, measure metadata, and job schedulers.
#spring boot app org.apache.griffin.core.GriffinWebApplication
ui : front end
Update several files to reflect your dev env
create a griffin working directory in hdfs
hdfs dfs -mkdir -p <griffin working dir>
init quartz tables by service/src/main/resources/Init_quartz.sql
mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql
update service/src/main/resources/application.properties
spring.datasource.url = jdbc:mysql://<MYSQL-IP>:3306/quartz?autoReconnect=true&useSSL=false spring.datasource.username = <user name> spring.datasource.password = <password> hive.metastore.uris = thrift://<HIVE-IP>:9083 hive.metastore.dbname = <hive database name> # default is "default"
update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.
/*Please update as your elastic search instance*/ "api": "http://<ES-IP>:9200/griffin/accuracy"
update service/src/main/resources/sparkJob.properties file
sparkJob.file = hdfs://<griffin working directory>/griffin-measure.jar sparkJob.args_1 = hdfs://<griffin working directory>/env.json sparkJob.jars_1 = hdfs://<pathTo>/datanucleus-api-jdo-3.2.6.jar sparkJob.jars_2 = hdfs://<pathTo>/datanucleus-core-3.2.10.jar sparkJob.jars_3 = hdfs://<pathTo>/datanucleus-rdbms-3.2.9.jar sparkJob.uri = http://<LIVY-IP>:8998/batches
update ui/js/services/service.js
#make sure you can access es by http ES_SERVER = "http://<ES-IP>:9200"
Build
cd incubator-griffin mvn clean install -DskipTests#cp jars to hdfd griffin working dircp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jarhdfs dfs -put griffin-measure.jar <griffin working dir>
Run
#Please find the service jar with version in target folder. java -jar service/target/service.xxx.jar #open from your browser http://<YOUR-IP>://8080
Run with data prepared
Click "Data Assets" at the top right corner, to watch all the exist data assets.
We've prepared two data asset in https://github.com/apache/incubator-griffin/tree/master/docker/griffin_demo/prep/data, put those data into hive, you can see all the table metadata in Hive.Click "Measures" button at the top left corner to watch all the measures here, and you can also create a new DQ measurement by following steps.
- Click "Create Measure" button at the top left corner, choose the top left block "Accuracy", at current we only support accuracy type.
- Choose Source: find "demo_src" in the left tree, select some or all attributes in the right block, click "Next".
- Choose Target: find "demo_tgt" in the left tree, select the matching attributes with source data asset in the right block, click "Next".
- Mapping Source and Target: select "Source Fields" of each row, to match the corresponding field in target table, e.g. id maps to id, age maps to age, desc maps to desc.
Finish all the mapping, click "Next". - Fill out the required table as required, "Organization" is the group of this measurement.
Submit and save, you can see your new DQ measurement created in the measures list.
Now you've created a new DQ measurement, the measurement needs to be scheduled to run in the docker container. Click "Jobs" button to watch all the jobs here, at current there is no job, you need to create a new one. Click "Create Job" button at the top left corner, fill out all the blocks as below.
"Source Partition": YYYYMMdd-HH "Target Partition": YYYYMMdd-HH "Measure Name": <choose the measure you just created> "Start After(s)": 0 "Interval": 300
The source and target partition means the partition pattern of the demo data, which is based on timestamp, "Start After(s)" means the job will start after n seconds, "Interval" is the interval of job, the unit is second. In the example above, the job will run every 5 minutes.
Wait for about 1 minute, after the calculation, results would be published to web UI, then you can watch the dashboard by clicking "DQ Metrics" at the top right corner.
License Header File
Each source file should include the following Apache License header
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
12 Comments
neil wang
Hi William:
Is the version of Hive must be older then 1.2.1? When I use hive-1.1.0-cdh5.8.2,I can submit jobs to yarn cluster but the error says tables doesn't exist .
thanks.
William Guo
could you show us the error log?
neil wang
Hi William:
The error log is as follows:
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on xy180-wecloud-85:46076 (size: 1725.0 B, free: 1060.0 MB)
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:10 WARN AmIpFilter: Could not find proxy-user cookie, so user will not be set
18/01/05 18:01:11 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1145 ms on xy180-wecloud-85 (1/1)
18/01/05 18:01:11 INFO YarnClusterScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/01/05 18:01:11 INFO DAGScheduler: ResultStage 0 (collect at RuleAdaptorGroup.scala:41) finished in 1.152 s
18/01/05 18:01:11 INFO DAGScheduler: Job 0 finished: collect at RuleAdaptorGroup.scala:41, took 1.365230 s
18/01/05 18:01:11 INFO Application$: process init success
18/01/05 18:01:11 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 10.20.180.89:46719 in memory (size: 1725.0 B, free: 1036.5 MB)
18/01/05 18:01:11 INFO BlockManagerInfo: Removed broadcast_0_piece0 on xy180-wecloud-85:46076 in memory (size: 1725.0 B, free: 1060.0 MB)
18/01/05 18:01:11 INFO ContextCleaner: Cleaned accumulator 1
18/01/05 18:01:11 INFO HiveBatchDataConnector: SELECT * FROM wedw_dw.expert_std_df
18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_table from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
18/01/05 18:01:11 INFO HiveMetaStore: 0: get_table : db=wedw_dw tbl=expert_std_df
18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_table : db=wedw_dw tbl=expert_std_df
18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_table start=1515146471675 end=1515146471689 duration=14 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=-1 error=true>
18/01/05 18:01:11 ERROR HiveBatchDataConnector: load hive table wedw_dw.expert_std_df fails: Table not found: `wedw_dw`.`expert_std_df`; line 1 pos 22
18/01/05 18:01:11 WARN DataSource: load data source [source] fails
18/01/05 18:01:11 INFO HiveBatchDataConnector: SELECT * FROM wedw_dwd.expert_basic_dz
18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_table from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
18/01/05 18:01:11 INFO HiveMetaStore: 0: get_table : db=wedw_dwd tbl=expert_basic_dz
18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_table : db=wedw_dwd tbl=expert_basic_dz
18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_table start=1515146471725 end=1515146471726 duration=1 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=-1 error=true>
18/01/05 18:01:11 ERROR HiveBatchDataConnector: load hive table wedw_dwd.expert_basic_dz fails: Table not found: `wedw_dwd`.`expert_basic_dz`; line 1 pos 23
18/01/05 18:01:11 WARN DataSource: load data source [target] fails
18/01/05 18:01:11 INFO PerfLogger: <PERFLOG method=get_tables from=org.apache.hadoop.hive.metastore.RetryingHMSHandler>
18/01/05 18:01:11 INFO HiveMetaStore: 0: get_tables: db=default pat=.*
18/01/05 18:01:11 INFO audit: ugi=pgxl ip=unknown-ip-addr cmd=get_tables: db=default pat=.*
18/01/05 18:01:11 INFO PerfLogger: </PERFLOG method=get_tables start=1515146471849 end=1515146471865 duration=16 from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=0 retryCount=0 error=false>
18/01/05 18:01:11 INFO Application$: process run success
18/01/05 18:01:12 INFO SparkUI: Stopped Spark web UI at http://10.20.180.89:54831
18/01/05 18:01:12 INFO YarnClusterSchedulerBackend: Shutting down all executors
18/01/05 18:01:12 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
18/01/05 18:01:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/01/05 18:01:12 INFO MemoryStore: MemoryStore cleared
18/01/05 18:01:12 INFO BlockManager: BlockManager stopped
18/01/05 18:01:12 INFO BlockManagerMaster: BlockManagerMaster stopped
18/01/05 18:01:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
William Guo
Where do you get the table wedw_dwd.expert_basic_dz?
By UI or you specify the table in json by manually?
Lionel Liu could you look at this?
Thanks,
William
neil wang
Hi William,
Both ways i have try , the errors are same.I can get the table by Hive or SparkSql.
thanks,
Neil
Lionel Liu
Hi Neil,
You mean you've tried the way of submitting the job directly using measure.jar like this https://github.com/apache/incubator-griffin/blob/master/griffin-doc/measure-batch-sample.md right? It runs by spark-submit command directly, not through livy.
The issue seems like in spark application it can not access your hive tables. We'll have something to check.
First is the spark context has generated the hive context as sql context successfully, you can find that in log, if fails, you need to confirm the spark can access hive-site.xml.
Second is spark should be able to access hive, you can check it in spark-shell, to test if it can access this specific table.
Third, if you can access this hive table in spark-shell, you can try to submit griffin job by spark-submit command directly, it should perform the same as spark-shell way. Actually, if you submit spark job through livy, it runs in cluster mode, not in client mode, the hive-site.xml need to be in hdfs, and in livy conf file should set livy.repl.enableHiveContext = true, like this: https://github.com/bhlx3lyx7/griffin-docker/blob/master/griffin_env/conf/livy/livy.conf
Hope this can help you.
Thank,
Lionel Liu
neil wang
Hi Lionel,
Thanks for your reply.
I have submit griffin job by spark-submit command successful, in this way i need set '--files <hdfs://hive-site.xml>'.If i want to submit jobs by using Griffin Web UI and in spark-cluster mode, how to config the hive-site.xml?
Another question,if the partition field of the table in hive is not 'dt' , how can i change the field by using Griffin Web UI ? I can change it by change the config.json , but it seems not convenient.
Thanks
Neil Wang
neil wang
Hi Lionel,
Please ignore the first question,i have resolve it by put the hive-site.xml into the hadoop's config folder.
thanks
Lionel Liu
Hi Neil,
Your question hits the point. The web UI can only deal with 'dt' and 'hour' partition, it can not be a solution. We've enhanced scheduler in service and web UI, you can see it in the next version by this month.
Thanks,
Lionel Liu
neil wang
Hi Lionel,
Our team is be interested in Griffin and we will look forward to the next verison.
To test the web UI, I create two tables are partitioned by 'dt' and 'hour'. I can see the result after computing in ES, but can't see the dashboard.Can you offer me some suggestion?
Thanks,
Neil Wang
Lionel Liu
Great, I think you've mostly succeed. For the latest version 0.1.6, we didn't update UI part, so it's not completely fit for the backend, it get metrics from ES directly, so the client need to access ES by modifying the ES_SERVER in service.service.ts.
We will update UI for our new version, the metrics will be fetched through service module, that will be a better solution.
Lionel Liu
Hi neil wang,
Griffin is going to release new version 0.2.0, with job scheduling process in UI, and some enhancement of measure engine.
You could try our new docker image `bhlx3lyx7/svr_msr:0.2.0` as this guide document https://github.com/apache/incubator-griffin/blob/master/griffin-doc/docker/griffin-docker-guide.md
Your team might have some requirement of data quality, what's your use cases? I think we can have a talk about it. Would you like to schedule a meeting with us or talk in email through dev-list: dev@griffin.incubator.apache.org ?
Thanks,
Lionel Liu