As a monitoring platform, eagle not only responsible for monitoring cluster/node healthy, but also for monitoring apps(jobs) running on the cluster
Following are some common job monitoring user cases on hadoop platform:
1) Job security monitoring: If a job has malicious data operation like access confidential data or delete larges amounts of data
2) Job performance: Is a job run slower this time compared with its historical running? Does the job has data skew issue leading to one task of the job run much slower that other tasks?
To meet the above requirements, we design the eagle storm running job spout, which first support job security monitoring user case
The "running" in running job spout doesn't mean we only monitoring running job, here "running" means "realtime", we also collect completed job information if we miss catching them before they finished due to issue like storm worker crash
Also we use zookeeper to store already processed job info list, along with storm ACK mechanism, the running job spout can delivery at-least-once semantic
Eagle running job spout collect the following data, following is the running job spout work flow
1) Running/Completed Job List
2) Job Detail Info
3) Job Configuration Info
4) Job Counters
Following are some interfaces
Support Spark Job Monitoring
Eagle running job spout pick up MR job monitoring as its first case, and consider to support spark job monitoring as well,
Spark on Yarn Environment Setup
1) prepare: install hdfs, yarn, java7, scala2.10
2) download spark: wget "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz"
3) unzip and put it in /opt/spark-1.5.2-bin-hadoop2.6
export SPARK_HOME=/opt/spark-1.5.2
export PATH=$PATH:$SPARK_HOME/bin
spark.yarn.max_executor.failures 3 spark.yarn.applicationMaster.waitTries 10 spark.history.kerberos.keytab none spark.yarn.preserve.staging.files False spark.yarn.submit.file.replication 3 spark.history.kerberos.principal none spark.yarn.historyServer.address <hostname>:18080 spark.yarn.scheduler.heartbeat.interval-ms 5000 spark.yarn.queue default spark.yarn.containerLauncherMaxThreads 25 spark.yarn.driver.memoryOverhead 384 spark.history.ui.port 18080 spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.yarn.max.executor.failures 3 spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.executor.memoryOverhead 384 spark.eventLog.enabled true spark.eventLog.dir hdfs://<hostname>:8020/directory
Spark Restful API for monitoring
Following are some spark restful APIs
Spark on Yarn Environment Setup
Following are some steps for setup a test spark on yarn env
1) prepare: install hdfs, yarn, java7, scala2.10
2) download spark: wget "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz"
3) unzip and put it in /opt/spark-1.5.2-bin-hadoop2.6
export SPARK_HOME=/opt/spark-1.5.2-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
4) set config for spark job, here we forward spark applications' logs to hdfs, then spark history server can read logs and expose restful APIs to report application status(history server can report both running & completed application status)
spark.yarn.max_executor.failures 3 spark.yarn.applicationMaster.waitTries 10 spark.history.kerberos.keytab none spark.yarn.preserve.staging.files False spark.yarn.submit.file.replication 3 spark.history.kerberos.principal none spark.yarn.historyServer.address <hostname>:18080 spark.yarn.scheduler.heartbeat.interval-ms 5000 spark.yarn.queue default spark.yarn.containerLauncherMaxThreads 25 spark.yarn.driver.memoryOverhead 384 spark.history.ui.port 18080 spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService spark.yarn.max.executor.failures 3 spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider spark.yarn.executor.memoryOverhead 384 spark.eventLog.enabled true spark.eventLog.dir hdfs://<hostname>:8020/directory
5) set history server config in /opt/spark-1.5.2-bin-hadoop2.6/bin/load-spark-env.sh
export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://<hostname>:8020/directory"
6) ./sbin/start-master.sh hdfs://druid-test-host1-556191.slc01.dev.ebayc3.com:8020
./sbin/start-slave.sh spark://localhost:7077
./sbin/start-history-server.sh hdfs://<hostaname>:8020
Spark Restful API for monitoring
Following are some spark restful APIs
List spark applications: http://<hostname>:18080/api/v1/applications
Return the stages info of a specific application: http://<hostname>:18080/api/v1/applications/application_1452593058395_0008/stages
Return the job info of a specific application: http://<hostname>:18080/api/v1/applications/application_1452593058395_0008/jobs