Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

 

Setup Development Env

By this tutorial, you will be able to build griffin dev environment to go through all griffin data quality process as below

  • explore data assets,
  • create measures,
  • schedule measures,
  • execute measures in compute clusters and  emit metrics
  • navigate metrics in dashboard.

Dev dependencies

Java :

we prefer java 8, but java 7 is fine for us.

Maven : 

Prerequisities version is 3.2.5

Scala

Prerequisities version is 2.10

Angular

We are using 1.5.8

Node

We are using 6.0.0+

Bower

Code Block
languagebash
 npm install -g bower

 

Env dependencies

Hadoop

Prerequisities Prerequisites version is 2.6.0

Hive

Prerequisities Prerequisites version is 1.2.1

Spark

Prerequisities Prerequisites version is 1.6.x

Mysql

 Prerequisites version is 5.0

Elastic search

Prerequisities Prerequisites version is 5.x.x

Make sure you can access your elastic search instance by http protocol.

...

Code Block
languagetext
#livy has one bug (https://issues.cloudera.org/browse/LIVY-94), so we need to makeput these three jars into hdfs
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
#Then you need to add some configuration in json when submit spark classpath
 jobs through livy.
"jars": [ "hdfs:///livy/datanucleus-api-jdo-3.2.6.jar
", "hdfs:///livy/datanucleus-core-3.2.10.jar
", "hdfs:///livy/datanucleus-rdbms-3.2.9.jar" ]
 
#We need hive accessible in spark application, but there needs some configuration if submit through livy.
#livy.conf
livy.repl.enableHiveContext = true
#hive-site.xml should be put into hdfs, to be accessible by the spark cluster. For example, we can put it as below:
hdfs:///livy/hive-site.xml
#Then you need to add some configuration in json when submit spark jobs through livy.
"conf": { "spark.yarn.dist.files": "hdfs:///livy/hive-site.xml" }
#or like this
"files": [ "hdfs:///livy/hive-site.xml" ]

 

Setup Dev Env

Git clone

Code Block
languagebash
git clone https://github.com/apache/incubator-griffin.git

Project layout

There are three modules in griffin

...

Code Block
#app
org.apache.griffin.measure.batch.Application

 

service : web service for data assets, measure metadata, and job schedulers.

Code Block
languagebash
#spring boot app
org.apache.griffin.core.GriffinWebApplication

 

ui : front end 

 

Update several files to reflect your dev env

create a griffin working directory in hdfs
Code Block
languagebash
hdfs dfs -mkdir -p <griffin working dir>
init quartz tables by service/src/main/resources/Init_quartz.sql
Code Block
languagebash
mysql -u username -p quartz < service/src/main/resources/Init_quartz.sql

 

update service/src/main/resources/application.properties
Code Block
languagetext
spring.datasource.url = jdbc:mysql://<MYSQL-IP>:3306/quartz?autoReconnect=true&useSSL=false
spring.datasource.username = <user name>
spring.datasource.password = <password>

hive.metastore.uris = thrift://<HIVE-IP>:9083
hive.metastore.dbname = <hive database name>    # default is "default"
update measure/src/main/resources/env.json with your elastic search instance, and copy env.json to griffin working directory in hdfs.
Code Block
languagejs
/*Please update as your elastic search instance*/
"api": "http://<ES-IP>:9200/griffin/accuracy"

update service/src/main/resources/sparkJob.properties file
Code Block
languagetext
sparkJob.file = hdfs://<griffin working directory>/griffin-measure.jar
sparkJob.args_1 = hdfs://<griffin working directory>/env.json
sparkJob.jars_1 = hdfs://<pathTo>/datanucleus-api-jdo-3.2.6.jar
sparkJob.jars_2 = hdfs://<pathTo>/datanucleus-core-3.2.10.jar
sparkJob.jars_3 = hdfs://<pathTo>/datanucleus-rdbms-3.2.9.jar
sparkJob.uri = http://<LIVY-IP>:8998/batches

 

update ui/js/services/service.js
Code Block
languagetext
#make sure you can access es by http
ES_SERVER = "http://<ES-IP>:9200"

Build

 

Code Block
languagetext
cd incubator-griffin
mvn clean install -DskipTests#cp jars to hdfd griffin working dircp /measure/target/measure-0.1.3-incubating-SNAPSHOT.jar /measure/target/griffin-measure.jarhdfs dfs -put griffin-measure.jar <griffin working dir>

Run

Code Block
languagebash
#Please find the service jar with version in target folder.
java -jar service/target/service.xxx.jar
#open from your browser
http://<YOUR-IP>://8080

 

Run with data prepared

  1. Click "Data Assets" at the top right corner, to watch all the exist data assets.
    We've prepared two data asset in https://github.com/apache/incubator-griffin/tree/master/docker/griffin_demo/prep/data, put those data into hive, you can see all the table metadata in Hive.

  2. Click "Measures" button at the top left corner to watch all the measures here, and you can also create a new DQ measurement by following steps.

    1. Click "Create Measure" button at the top left corner, choose the top left block "Accuracy", at current we only support accuracy type.
    2. Choose Source: find "demo_src" in the left tree, select some or all attributes in the right block, click "Next".
    3. Choose Target: find "demo_tgt" in the left tree, select the matching attributes with source data asset in the right block, click "Next".
    4. Mapping Source and Target: select "Source Fields" of each row, to match the corresponding field in target table, e.g. id maps to id, age maps to age, desc maps to desc.
      Finish all the mapping, click "Next".
    5. Fill out the required table as required, "Organization" is the group of this measurement.
      Submit and save, you can see your new DQ measurement created in the measures list.
  3. Now you've created a new DQ measurement, the measurement needs to be scheduled to run in the docker container. Click "Jobs" button to watch all the jobs here, at current there is no job, you need to create a new one. Click "Create Job" button at the top left corner, fill out all the blocks as below.

    "Source Partition": YYYYMMdd-HH
    "Target Partition": YYYYMMdd-HH
    "Measure Name": <choose the measure you just created>
    "Start After(s)": 0
    "Interval": 300
    

    The source and target partition means the partition pattern of the demo data, which is based on timestamp, "Start After(s)" means the job will start after n seconds, "Interval" is the interval of job, the unit is second. In the example above, the job will run every 5 minutes.

    Wait for about 1 minute, after the calculation, results would be published to web UI, then you can watch the dashboard by clicking "DQ Metrics" at the top right corner.

License Header File

Each source file should include the following Apache License header

Code Block
languagetext
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements.  See the NOTICE file
distributed with this work for additional information
regarding copyright ownership.  The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.