You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 6 Next »

Table of Content

 

 

Description

Target as a useful monitoring tool, eagle should provide additional monitoring information other than basic alerting definition. And by the definition, the alerting policy already has the definition of how to aggregate the monitoring stream input. Eagle should provide similar definition of stream aggregation and metric storage for the aggregated information. This page would include the design of 

  • How user could define the aggregated(analytic) information based on the given stream (time window would be enforced)
  • How eagle support the persistence of the metric (storage) : different persistence strategy
  • How could user leverage eagle's API to fetch the metric information

Things not included in this spec:

  • How the eagle UI be designed for the monitoring

Case Study

1. Hadoop jmx metric aggregation and plot

Such kind of metrics are not mean for alerting (alerting mostly defined on raw CPU/MEM side), but is usable for trend analysis.

2. Long term LB capacity monitoring

LB capacity trend is mostly steady grows when more and more vips added. This could be used by capacity planning. The metric window could as long as  weeks or even months, which means such kind of metric could be stored as coarse granularity.

3. Multiple source log/alert correlation

This is a typical case that eagle could provide its value as distributed aggregation on multiple data source. Such kind of feature are not covered by single source alert (like zabbix configuration on host which lack of multiple source information; ES for centralized log collection/search, but lack of streaming data processing framework; Druid as storage and kind of pre-aggregation, but lack of user defined streaming processing on it. For all lack of multiple stream join operation support).

 

Requirement

1. User could be able to store metrics in customized time window with timeBatch window.

2. User could be able to write their own stream processing logic, but without too much coding logic. Since require user to write stream handling logic is messy.

3. DSL should be flexible enough to support multiple stream join.

4. DSL evaluation should be flexible enough to support build window from history data (for large time window) * - this could be the same feature that could be used in alerting. 

Design

Analytic DSL Definition

Currently alert definition use siddhi dsl as the dialect. Analytic dsl would keep the same user experience. 

One example of the stream analytic processing is something like 

ec.fromKafka[AuditLog]
    .groupBy(_.user)
    .query( """
        from hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min)
        select user, count(timestamp) as aggValue
        group by user
        having aggValue >= 5
        insert into anotherAlertStream;
        """".stripMargin) 
// hdfsAuditLogEventStream -> anotherAlertStream

Partition

Partition is the mainly concern when talk to CEP handling. Basically, an analytic DSL above doesn't incorporate the partition itself.

Basic

A groupBy strategy should be chosen by user to have correct metrics routing to the same processing node. This is simply the same to how eagle alerting works. A customized and balanced algorithm could also be adopted. (See balancing algorithm of alerting policy evaluation)

 

Advanced

In extreme case, an analytic DSL on single node could not handle the partitioned stream, we need to incorporate the Diamond-style map-reduce process.

There could optional parallel hint expose to user, if they want to specify the parallel.

Note:

Such kind of map/reduce by framework might have requirement of the analytic behavior must be able to support simple map/reduce. This might require user with more care knowledge on how to write their logic in our DSL.

 

State management

The state management is to store/restore the state of monitoring state during streaming processing. A couple of aspects included

  1. Whole topology state store. Lightweight Asynchronous Snapshots for Distributed Dataflows is valid system method to take a snapshot of stream system while not halt the stream. 
  2. Single bolt status store
    In current tech stack adoption, Siddhi already support specify persistent store to persistent the engine status. This could be done the framework level to periodically take CEP engine snapshot to hbase store. Bolt status store including CEP engine and outside-engine status(e.g. policy acceptance related information)


Recovery

For the stored state

  1. Whole topology state would be used to restore when a topology restarted.
  2. Single bolt status restore:
    When a bolt is started, it would try to load from the snapshot store where the snapshot match the current bolt's policy acceptance. 

 

Persistence

As a general monitoring tool, eagle not mean to store even point of metric into storage, user have to define a time window to reduce the gratuity of the metrics. This is user written CEP-QL above.

ec.from().query(....).persist(MetaDescriptor) -- by default as eagle hbase metric storage

Persist Metadata Descriptor

The meta descriptor where user could used to guide how to store the metrics. This is including how the data should be persisted like storage reference, row-key generation for HBASE, or druid column mapping.

The storage should be metadata driven by a meta descriptor. A meta descriptor would be kind of metadata that eagle storage framework should based on.

Below is an storage meta descriptor which describe how to store this entity in hbase. If other storage option is supported, a new node under "storage" would be added and consumed by the storage framework.

{
   "id" : "meta-descriptor-1",
   "fields" : {
		"field3":{ 
			"name" : "field3",
 			"datatype" : "string"
	 	}
		....// other fields
   },
   "storage": {
		"hbase": {
			"table": "alertdef",
			"columNameGenerator" : "alaphbetGenerator"
			"prefix" : "alertdef",
			"serviceName":"AlertDefinitionService",
			"timeseries" : false
		 	"tags": [
				"site",
				"dataSource",
				"alertExecutorId",
				"policyId",
				"policyType"
			],
			"indexes": {
				"Index_1_alertExecutorId" : {
				"columns": [ "alertExecutorID" ],
				"unique": true
			}
	}
}

 

Persist Storage Metadata

This is the underling storage where eagle store the metrics. The default is hbase metric which is built in eagle. Druid might also be an option.

Storage is defined and stored in eagle backend hbase. Below JSON briefly describe the metadata.

{

"name": "hbase-default",

"type": "hbase",

"connectionString" : "",

"props": {...}

}

 

Metric API

Currently, this metric API is left highly coupled with the underlying storage, for hbase metric, use the eagle metric API. For druid, use the druid query API. Ideally, user might use SQL-style query to get the metrics.

 

 

  • No labels