Setup Development Env
By this tutorial, you will be able to build griffin dev environment to go through all griffin data quality process as below
- explore data assets,
- create measures,
- schedule measures,
- execute measures in compute clusters and emit metrics
- navigate metrics in dashboard.
Dev dependencies
Java :
we prefer java 8, but java 7 is fine for us.
Maven :
You can download latest maven from http://maven.apache.org/maven/download.cgi , prerequisities Prerequisities version is 3.2.5
Scala
You can download scala from https://www.scala-lang.org/download/install.html , prerequisities Prerequisities version is 2.10
Angular
We are using 1.5.8
Node
We are using 6.0.0+
Bower
Code Block | ||
---|---|---|
| ||
npm install -g bower |
Env dependencies
Hadoop
Prerequisities Prerequisites version is 2.6.0
Hive
Prerequisities Prerequisites version is 1.2.1
Spark
Prerequisities Prerequisites version is 1.6.x
Mysql
Prerequisites version is 5.0
Elastic search
Prerequisites version is 5.x.x
Make sure you can access your elastic search instance by http protocol.
Livy
Griffin submit jobs to spark by Livy( http://livy.io/quickstart.html )
Code Block | ||
---|---|---|
| ||
#livy has one bug (https://issues.cloudera.org/browse/LIVY-94), so we need to put these three jars into hdfs
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
#Then you need to add some configuration in json when submit spark jobs through livy.
"jars": [ "hdfs:///livy/datanucleus-api-jdo-3.2.6.jar", "hdfs:///livy/datanucleus-core-3.2.10.jar", "hdfs:///livy/datanucleus-rdbms-3.2.9.jar" ]
#We need hive accessible in spark application, but there needs some configuration if submit through livy.
#livy.conf
livy.repl.enableHiveContext = true
#hive-site.xml should be put into hdfs, to be accessible by the spark cluster. For example, we can put it as below:
hdfs:///livy/hive-site.xml
#Then you need to add some configuration in json when submit spark jobs through livy.
"conf": { "spark.yarn.dist.files": "hdfs:///livy/hive-site.xml" }
#or like this
"files": [ "hdfs:///livy/hive-site.xml" ] |
Setup Dev Env
...
Git clone
Code Block | ||
---|---|---|
| ||
git clone https://github.com/apache/incubator-griffin.git |
build
Code Block | ||
---|---|---|
| ||
cd incubator-griffin
mvn clean install -DskipTests |
dev
Project layout
There are three modules in griffin
measure : core algorithms for calculate metrics by different measure dimension.
Code Block |
---|
#app
org.apache.griffin.measure.Application |
service : web service for data assets, measure metadata, and job schedulers.
Code Block | ||
---|---|---|
| ||
#spring boot app
org.apache.griffin.core.GriffinWebApplication |
ui : front end
Update several files to reflect your dev env
create a griffin working directory in hdfs
Code Block | ||
---|---|---|
| ||
hdfs dfs -mkdir -p <griffin working dir> |
init quartz tables by service/src/main/resources/Init_quartz.sql
Code Block | ||
---|---|---|
| ||
mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql |
update service/src/main/resources/application.properties
Code Block | ||
---|---|---|
| ||
spring.datasource.url = jdbc:mysql://<MYSQL-IP>:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username = <user name>
spring.datasource.password = <password>
hive.metastore.uris = thrift://<HIVE-IP>:9083
hive.metastore.dbname = <hive database name> # default is "default" |
update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.
Code Block | ||
---|---|---|
| ||
/*Please update as your elastic search instance*/
"api": "http://<ES-IP>:9200/griffin/accuracy" |
update service/src/main/resources/sparkJob.properties file
Code Block | ||
---|---|---|
| ||
sparkJob.file = hdfs://<griffin working directory>/griffin-measure.jar
sparkJob.args_1 = hdfs://<griffin working directory>/env.json
sparkJob.jars_1 = hdfs://<pathTo>/datanucleus-api-jdo-3.2.6.jar
sparkJob.jars_2 = hdfs://<pathTo>/datanucleus-core-3.2.10.jar
sparkJob.jars_3 = hdfs://<pathTo>/datanucleus-rdbms-3.2.9.jar
sparkJob.uri = http://<LIVY-IP>:8998/batches |
update ui/js/services/service.js
Code Block | ||
---|---|---|
| ||
#make sure you can access es by http
ES_SERVER = "http://<ES-IP>:9200" |
Build
Code Block | ||
---|---|---|
| ||
cd incubator-griffin
mvn clean install -DskipTests#cp jars to hdfd griffin working dircp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jarhdfs dfs -put griffin-measure.jar <griffin working dir> |
Run
Code Block | ||
---|---|---|
| ||
#Please find the service jar with version in target folder.
java -jar service/target/service.xxx.jar
#open from your browser
http://<YOUR-IP>://8080 |
Run with data prepared
Click "Data Assets" at the top right corner, to watch all the exist data assets.
We've prepared two data asset in https://github.com/apache/incubator-griffin/tree/master/docker/griffin_demo/prep/data, put those data into hive, you can see all the table metadata in Hive.Click "Measures" button at the top left corner to watch all the measures here, and you can also create a new DQ measurement by following steps.
- Click "Create Measure" button at the top left corner, choose the top left block "Accuracy", at current we only support accuracy type.
- Choose Source: find "demo_src" in the left tree, select some or all attributes in the right block, click "Next".
- Choose Target: find "demo_tgt" in the left tree, select the matching attributes with source data asset in the right block, click "Next".
- Mapping Source and Target: select "Source Fields" of each row, to match the corresponding field in target table, e.g. id maps to id, age maps to age, desc maps to desc.
Finish all the mapping, click "Next". - Fill out the required table as required, "Organization" is the group of this measurement.
Submit and save, you can see your new DQ measurement created in the measures list.
Now you've created a new DQ measurement, the measurement needs to be scheduled to run in the docker container. Click "Jobs" button to watch all the jobs here, at current there is no job, you need to create a new one. Click "Create Job" button at the top left corner, fill out all the blocks as below.
"Source Partition": YYYYMMdd-HH "Target Partition": YYYYMMdd-HH "Measure Name": <choose the measure you just created> "Start After(s)": 0 "Interval": 300
The source and target partition means the partition pattern of the demo data, which is based on timestamp, "Start After(s)" means the job will start after n seconds, "Interval" is the interval of job, the unit is second. In the example above, the job will run every 5 minutes.
Wait for about 1 minute, after the calculation, results would be published to web UI, then you can watch the dashboard by clicking "DQ Metrics" at the top right corner.
License Header File
Each source file should include the following Apache License header
Code Block | ||
---|---|---|
| ||
Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |