0. Development

By this tutorial, you will be able to build griffin dev environment to go through all griffin data quality process as below

explore data assets,
create measures,
schedule measures,
execute measures in compute clusters and emit metrics
navigate metrics in dashboard.

Dev dependencies

Java :

we prefer java 8, but java 7 is fine for us.

Maven :

Prerequisities version is 3.2.5

Scala

Prerequisities version is 2.10

Angular

We are using 1.5.8

Node

We are using 6.0.0+

Bower

 npm install -g bower

Env dependencies

Hadoop

Prerequisites version is 2.6.0

Hive

Prerequisites version is 1.2.1

Spark

Prerequisites version is 1.6.x

Mysql

Prerequisites version is 5.0

Elastic search

Prerequisites version is 5.x.x

Make sure you can access your elastic search instance by http protocol.

Livy

Griffin submit jobs to spark by Livy( http://livy.io/quickstart.html )

#livy has one bug (https://issues.cloudera.org/browse/LIVY-94), so we need to put these three jars into hdfs
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
#Then you need to add some configuration in json when submit spark jobs through livy.
"jars": [ "hdfs:///livy/datanucleus-api-jdo-3.2.6.jar", "hdfs:///livy/datanucleus-core-3.2.10.jar", "hdfs:///livy/datanucleus-rdbms-3.2.9.jar" ]
 
#We need hive accessible in spark application, but there needs some configuration if submit through livy.
#livy.conf
livy.repl.enableHiveContext = true
#hive-site.xml should be put into hdfs, to be accessible by the spark cluster. For example, we can put it as below:
hdfs:///livy/hive-site.xml
#Then you need to add some configuration in json when submit spark jobs through livy.
"conf": { "spark.yarn.dist.files": "hdfs:///livy/hive-site.xml" }
#or like this
"files": [ "hdfs:///livy/hive-site.xml" ]

Setup Dev Env

Git clone

git clone https://github.com/apache/incubator-griffin.git

Project layout

There are three modules in griffin

measure : core algorithms for calculate metrics by different measure dimension.

#app
org.apache.griffin.measure.Application

service : web service for data assets, measure metadata, and job schedulers.

#spring boot app
org.apache.griffin.core.GriffinWebApplication

ui : front end

Update several files to reflect your dev env

create a griffin working directory in hdfs

hdfs dfs -mkdir -p <griffin working dir>

init quartz tables by service/src/main/resources/Init_quartz.sql

mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql

update service/src/main/resources/application.properties

spring.datasource.url = jdbc:mysql://<MYSQL-IP>:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username = <user name>
spring.datasource.password = <password>

hive.metastore.uris = thrift://<HIVE-IP>:9083
hive.metastore.dbname = <hive database name>    # default is "default"

update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.

/*Please update as your elastic search instance*/
"api": "http://<ES-IP>:9200/griffin/accuracy"

update service/src/main/resources/sparkJob.properties file

sparkJob.file = hdfs://<griffin working directory>/griffin-measure.jar
sparkJob.args_1 = hdfs://<griffin working directory>/env.json
sparkJob.jars_1 = hdfs://<pathTo>/datanucleus-api-jdo-3.2.6.jar
sparkJob.jars_2 = hdfs://<pathTo>/datanucleus-core-3.2.10.jar
sparkJob.jars_3 = hdfs://<pathTo>/datanucleus-rdbms-3.2.9.jar
sparkJob.uri = http://<LIVY-IP>:8998/batches

update ui/js/services/service.js

#make sure you can access es by http
ES_SERVER = "http://<ES-IP>:9200"

Build

cd incubator-griffin
mvn clean install -DskipTests#cp jars to hdfd griffin working dircp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jarhdfs dfs -put griffin-measure.jar <griffin working dir>

Run

#Please find the service jar with version in target folder.
java -jar service/target/service.xxx.jar
#open from your browser
http://<YOUR-IP>://8080

Run with data prepared

Click "Data Assets" at the top right corner, to watch all the exist data assets.
We've prepared two data asset in https://github.com/apache/incubator-griffin/tree/master/docker/griffin_demo/prep/data, put those data into hive, you can see all the table metadata in Hive.
Click "Measures" button at the top left corner to watch all the measures here, and you can also create a new DQ measurement by following steps.
1. Click "Create Measure" button at the top left corner, choose the top left block "Accuracy", at current we only support accuracy type.
2. Choose Source: find "demo_src" in the left tree, select some or all attributes in the right block, click "Next".
3. Choose Target: find "demo_tgt" in the left tree, select the matching attributes with source data asset in the right block, click "Next".
4. Mapping Source and Target: select "Source Fields" of each row, to match the corresponding field in target table, e.g. id maps to id, age maps to age, desc maps to desc.
  Finish all the mapping, click "Next".
5. Fill out the required table as required, "Organization" is the group of this measurement.
  Submit and save, you can see your new DQ measurement created in the measures list.
Now you've created a new DQ measurement, the measurement needs to be scheduled to run in the docker container. Click "Jobs" button to watch all the jobs here, at current there is no job, you need to create a new one. Click "Create Job" button at the top left corner, fill out all the blocks as below.
```
"Source Partition": YYYYMMdd-HH
"Target Partition": YYYYMMdd-HH
"Measure Name": <choose the measure you just created>
"Start After(s)": 0
"Interval": 300
```
The source and target partition means the partition pattern of the demo data, which is based on timestamp, "Start After(s)" means the job will start after n seconds, "Interval" is the interval of job, the unit is second. In the example above, the job will run every 5 minutes.
Wait for about 1 minute, after the calculation, results would be published to web UI, then you can watch the dashboard by clicking "DQ Metrics" at the top right corner.

License Header File

Each source file should include the following Apache License header

Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Page tree

Dev dependencies

Java :

Maven :

Scala

Angular

Node

Bower

Env dependencies

Hadoop

Hive

Spark

Mysql

Setup Dev Env

Git clone

Project layout

Update several files to reflect your dev env

create a griffin working directory in hdfs

init quartz tables by service/src/main/resources/Init_quartz.sql

update service/src/main/resources/application.properties

update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.

update service/src/main/resources/sparkJob.properties file

update ui/js/services/service.js

Build

Run

Run with data prepared

License Header File

12 Comments

neil wang

William Guo

neil wang

William Guo

neil wang

Lionel Liu

neil wang

neil wang

Lionel Liu

neil wang

Lionel Liu

Lionel Liu