Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Panel
titleTable of Content

Table of Contents
 

 

Description

Target as a useful monitoring tool, eagle should provide additional monitoring information other than basic alerting definition. And by the definition, the alerting policy already has the definition of how to aggregate the monitoring stream input. Eagle should provide similar definition of stream aggregation and metric storage for the aggregated information. This page would include the design of 

...

  • How the eagle UI be designed for the monitoring

Case Study

1. Hadoop jmx metric aggregation and plot

Such kind of metrics are not mean for alerting (alerting mostly defined on raw CPU/MEM side), but is usable for trend analysis.

2. Long term LB capacity monitoring

LB capacity trend is mostly steady grows when more and more vips added. This could be used by capacity planning. The metric window could as long as  weeks or even months, which means such kind of metric could be stored as coarse granularity.

3. Multiple source log/alert correlation

This is a typical case that eagle could provide its value as distributed aggregation on multiple data source. Such kind of feature are not covered by single source alert (like zabbix configuration on host which lack of multiple source information; ES for centralized log collection/search, but lack of streaming data processing framework; Druid as storage and kind of pre-aggregation, but lack of user defined streaming processing on it. For all lack of multiple stream join operation support).

 

Requirement

1. User could be able to store metrics in customized time window with timeBatch window.

...

4. DSL evaluation should be flexible enough to support build window from history data (for large time window) * - this could be the same feature that could be used in alerting. 

Design

Analytic DSL Definition

Currently alert definition use siddhi dsl as the dialect. Analytic dsl would keep the same user experience. 

...

Code Block
ec.fromKafka[AuditLog]
    .groupBy(_.user)
    .query( """
        from hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min)
        select user, count(timestamp) as aggValue
        group by user
        having aggValue >= 5
        insert into anotherAlertStream;
        """".stripMargin) 
// hdfsAuditLogEventStream -> anotherAlertStream

Partition

Partition is the mainly concern when talk to CEP handling. Basically, an analytic DSL above doesn't incorporate the partition itself.

...

Such kind of map/reduce by framework might have requirement of the analytic behavior must be able to support simple map/reduce. This might require user with more care knowledge on how to write their logic in our DSL.

 

State management

The state management is to store/restore the state of monitoring state during streaming processing. A couple of aspects included

  1. Whole topology state store. Lightweight Asynchronous Snapshots for Distributed Dataflows is valid system method to take a snapshot of stream system while not halt the stream. 
  2. Single bolt status store
    In current tech stack adoption, Siddhi already support specify persistent store to persistent the engine status. This could be done the framework level to periodically take CEP engine snapshot to hbase store. Bolt status store including CEP engine and outside-engine status(e.g. policy acceptance related information)


Recovery

For the stored state

  1. Whole topology state would be used to restore when a topology restarted.
  2. Single bolt status restore:
    When a bolt is started, it would try to load from the snapshot store where the snapshot match the current bolt's policy acceptance. 

 

Historical data

This related to the how analytic runtime handle the long time window data. A supportive analytic runtime should be able to embrace the pre-load of historical data since the framework should be able to call the runtime to load/store snapshots at correct time.

 

This state management could be complete topic its won. It's now covered at Policy State Management.

 

Persistence

As a general monitoring tool, eagle not mean to store even point of metric into storage, user have to define a time window to reduce the gratuity of the metrics. This is user written CEP-QL above.

Code Block
ec.from().query(....).persist(MetaDescriptor) -- by default as eagle hbase metric storage

Persist

...

Metadata Descriptor

The meta descriptor where user could used to guide how to store the metrics. This is including how the data should be persisted like storage reference, row-key generation for HBASE, or druid column mapping.

...

Code Block
{
   "id" : "meta-descriptor-1",
   "fields" : {
		"field3":{ 
			"name" : "field3",
 			"datatype" : "string"
	 	}
		....// other fields
   },
   "storage": {
		"hbase": {
			"table": "alertdef",
			"columNameGenerator" : "alaphbetGenerator"
			"prefix" : "alertdef",
			"serviceName":"AlertDefinitionService",
			"timeseries" : false
		 	"tags": [
				"site",
				"dataSource",
				"alertExecutorId",
				"policyId",
				"policyType"
			],
			"indexes": {
				"Index_1_alertExecutorId" : {
				"columns": [ "alertExecutorID" ],
				"unique": true
			}
	}
}

...

 

Persist Storage

...

Metadata

This is the underling storage where eagle store the metrics. The default is hbase metric which is built in eagle. Druid might also be an option.

...

Info

{

"name": "hbase-default",

"type": "hbase",

"connectionString" : "",

"props": {...}

}

 

Metric API

Currently, this metric API is left highly coupled with the underlying storage, for hbase metric, use the eagle metric API. For druid, use the druid query API. Ideally, user might use SQL-style query to get the metrics.

...