Realtime Monitoring for Hadoop Jobs

As a monitoring platform, eagle not only responsible for monitoring cluster/node healthy, but also for monitoring apps(jobs) running on the cluster

Following are some common job monitoring user cases on hadoop platform:

1) Job security monitoring: If a job has malicious data operation like access confidential data or delete larges amounts of data

2) Job performance: Is a job run slower this time compared with its historical running? Does the job has data skew issue leading to one task of the job run much slower that other tasks?

To meet the above requirements, we design the eagle storm running job spout, which first support job security monitoring user case

The "running" in running job spout doesn't mean we only monitoring running job, here "running" means "realtime", we also collect completed job information if we miss catching them before they finished due to issue like storm worker crash

Also we use zookeeper to store already processed job info list, along with storm ACK mechanism, the running job spout can delivery at-least-once semantic 　

Eagle running job spout collect the following data, following is the running job spout work flow

1) Running/Completed Job List

2) Job Detail Info

3) Job Configuration Info

4) Job Counters

Running Job Spouts Design

Following are some interfaces

ResourceFetcher

public interface ResourceFetcher {

   List<Object> getResource(JobConstants.ResourceType resourceType, Object... parameter) throws Exception;

}

ServiceURLBuilder

public interface ServiceURLBuilder {
   String build(String ... parameters);
}

RunningJobCallback

/**
 * callback when running job info is ready
 */
public interface RunningJobCallback extends Serializable{
      
   /**
    * this is called when running job resource is ready
    * @param jobContext
    * @param type
    * @param objects
    */
   void onJobRunningInformation(JobContext jobContext, JobConstants.ResourceType type, List<Object> objects);
}

HAURLSelector

public interface HAURLSelector {
   
   boolean checkUrl(String url);
      
   void reSelectUrl() throws IOException;
   
   String getSelectedUrl();
}

Support Spark Job Monitoring

Eagle running job spout pick up MR job monitoring as its first case, and consider to support spark job monitoring as well

Spark on Yarn Environment Setup

Following are some steps for setup a test spark on yarn env

1) prepare: install hdfs, yarn, java7, scala2.10

2) download spark: wget "http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz"

3) unzip and put it in /opt/spark-1.5.2-bin-hadoop2.6

export SPARK_HOME=/opt/spark-1.5.2-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin

4) set config for spark job, here we forward spark applications' logs to hdfs, then spark history server can read logs and expose restful APIs to report application status(history server can report both running & completed application status)

/opt/spark-1.5.2-bin-hadoop2.6/conf/spark-defaults.conf

spark.yarn.max_executor.failures 3
spark.yarn.applicationMaster.waitTries 10
spark.history.kerberos.keytab none
spark.yarn.preserve.staging.files False
spark.yarn.submit.file.replication 3
spark.history.kerberos.principal none
spark.yarn.historyServer.address <hostname>:18080
spark.yarn.scheduler.heartbeat.interval-ms 5000
spark.yarn.queue default
spark.yarn.containerLauncherMaxThreads 25
spark.yarn.driver.memoryOverhead 384
spark.history.ui.port 18080
spark.yarn.services org.apache.spark.deploy.yarn.history.YarnHistoryService
spark.yarn.max.executor.failures 3
spark.history.provider org.apache.spark.deploy.yarn.history.YarnHistoryProvider
spark.yarn.executor.memoryOverhead 384
spark.eventLog.enabled true
spark.eventLog.dir hdfs://<hostname>:8020/directory

5) set history server config in /opt/spark-1.5.2-bin-hadoop2.6/bin/load-spark-env.sh

export SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://<hostname>:8020/directory"

6) ./sbin/start-master.sh hdfs://druid-test-host1-556191.slc01.dev.ebayc3.com:8020

./sbin/start-slave.sh spark://localhost:7077

./sbin/start-history-server.sh hdfs://<hostaname>:8020

Spark Restful API for monitoring

Following are some spark restful APIs

List spark applications: http://<hostname>:18080/api/v1/applications

Spark Application List

[
	{
		id: "application_1452593058395_0008",
		name: "PySparkShell",
		attempts: [
			{
				startTime: "2016-01-13T09:55:43.701GMT",
				endTime: "2016-01-13T09:57:52.658GMT",
				sparkUser: "root",
				completed: true
			}
		]
	},
	{
		id: "application_1452593058395_0007",
		name: "PySparkShell",
		attempts: [
			{
				startTime: "2016-01-13T08:22:12.346GMT",
				endTime: "2016-01-13T09:48:25.615GMT",
				sparkUser: "root",
				completed: true
			}
		]
	},
	{
		id: "application_1452593058395_0006",
		name: "PySparkShell",
		attempts: [
			{
				startTime: "2016-01-12T15:27:49.038GMT",
				endTime: "2016-01-12T18:05:48.678GMT",
				sparkUser: "root",
				completed: false
			}
		]
	}	
]

Return the stages info of a specific application: http://<hostname>:18080/api/v1/applications/application_1452593058395_0008/stages

Spark Application's Stage Info

[
	{
		status: "COMPLETE",
		stageId: 0,
		attemptId: 0,
		numActiveTasks: 0,
		numCompleteTasks: 2,
		numFailedTasks: 0,
		executorRunTime: 2256,
		inputBytes: 383,
		inputRecords: 16,
		outputBytes: 0,
		outputRecords: 0,
		shuffleReadBytes: 0,
		shuffleReadRecords: 0,
		shuffleWriteBytes: 0,
		shuffleWriteRecords: 0,
		memoryBytesSpilled: 0,
		diskBytesSpilled: 0,
		name: "count at <stdin>:1",
		details: "",
		schedulingPool: "default",
		accumulatorUpdates: [ ]
	},
	{
		status: "FAILED",
		stageId: 1,
		attemptId: 0,
		numActiveTasks: 1,
		numCompleteTasks: 0,
		numFailedTasks: 7,
		executorRunTime: 497,
		inputBytes: 1149,
		inputRecords: 55,
		outputBytes: 0,
		outputRecords: 0,
		shuffleReadBytes: 0,
		shuffleReadRecords: 0,
		shuffleWriteBytes: 0,
		shuffleWriteRecords: 0,
		memoryBytesSpilled: 0,
		diskBytesSpilled: 0,
		name: "sum at <stdin>:1",
		details: "",
		schedulingPool: "default",
		accumulatorUpdates: [ ]
	}
]

Return the job info of a specific application: http://<hostname>:18080/api/v1/applications/application_1452593058395_0008/jobs

Spark Application's job info

[
	{
		jobId: 1,
		name: "sum at <stdin>:1",
		submissionTime: "2016-01-13T09:56:43.335GMT",
		completionTime: "2016-01-13T09:56:43.710GMT",
		stageIds: [
			1
		],
		status: "FAILED",
		numTasks: 2,
		numActiveTasks: 1,
		numCompletedTasks: 0,
		numSkippedTasks: 0,
		numFailedTasks: 7,
		numActiveStages: 0,
		numCompletedStages: 0,
		numSkippedStages: 0,
		numFailedStages: 1
	},
	{
		jobId: 0,
		name: "count at <stdin>:1",
		submissionTime: "2016-01-13T09:56:07.496GMT",
		completionTime: "2016-01-13T09:56:09.299GMT",
		stageIds: [
			0
		],
		status: "SUCCEEDED",
		numTasks: 2,
		numActiveTasks: 0,
		numCompletedTasks: 2,
		numSkippedTasks: 2,
		numFailedTasks: 0,
		numActiveStages: 0,
		numCompletedStages: 1,
		numSkippedStages: 0,
		numFailedStages: 0
	}
]

Notes

Spark History Server reply on logs written by spark applications to report applications' status

But sometime logs may not be correctly updated by spark jobs, for example the following job is actually completed, but the logs on hdfs shows it's still in progress(not completed), which cause spark history server report wrong status

ID	User	Name	Application Type	Queue	StartTime	FinishTime	State	FinalStatus	Progress	Tracking UI
application_1452593058395_0006	root	PySparkShell	SPARK	default	Tue, 12 Jan 2016 15:27:54 GMT	Tue, 12 Jan 2016 18:05:49 GMT	FINISHED	SUCCEEDED		History

hdfs dfs -ls /directory/
Found 4 items
-rwxrwx--- 3 root supergroup 13227 2016-01-12 15:27 /directory/application_1452593058395_0005
-rwxrwx--- 3 root supergroup 13227 2016-01-12 18:05 /directory/application_1452593058395_0006.inprogress
-rwxrwx--- 3 root supergroup 51025 2016-01-13 09:48 /directory/application_1452593058395_0007
-rwxrwx--- 3 root supergroup 67994 2016-01-13 09:57 /directory/application_1452593058395_0008

Page tree

Realtime Monitoring for Hadoop Jobs

Support Spark Job Monitoring

Spark on Yarn Environment Setup

Spark Restful API for monitoring

Notes