Panel | ||
---|---|---|
| ||
|
Description
Target as a useful monitoring tool, eagle should provide additional monitoring information other than basic alerting definition. And by the definition, the alerting policy already has the definition of how to aggregate the monitoring stream input. Eagle should provide similar definition of stream aggregation and metric storage for the aggregated information. This page would include the design of
...
- How the eagle UI be designed for the monitoring
Case Study
1. Hadoop jmx metric aggregation and plot
...
This is a typical case that eagle could provide its value as distributed aggregation on multiple data source. Such kind of feature are not covered by single source alert (like zabbix configuration on host which lack of multiple source information; ES for centralized log collection/search, but lack of streaming data processing framework; Druid as storage and kind of pre-aggregation, but lack of user defined streaming processing on it. For all lack of multiple stream join operation support).
Requirement
1. User could be able to store metrics in customized time window with timeBatch window.
...
4. DSL evaluation should be flexible enough to support build window from history data (for large time window) * - this could be the same feature that could be used in alerting.
Design
Analytic DSL Definition
Currently alert definition use siddhi dsl as the dialect. Analytic dsl would keep the same user experience.
...
Code Block |
---|
ec.fromKafka[AuditLog] .groupBy(_.user) .query( """ from hdfsAuditLogEventStream[(src == '/tmp/private')]#window.externalTime(timestamp,10 min) select user, count(timestamp) as aggValue group by user having aggValue >= 5 insert into anotherAlertStream; """".stripMargin) // hdfsAuditLogEventStream -> anotherAlertStream |
Partition
Partition is the mainly concern when talk to CEP handling. Basically, an analytic DSL above doesn't incorporate the partition itself.
...
Such kind of map/reduce by framework might have requirement of the analytic behavior must be able to support simple map/reduce. This might require user with more care knowledge on how to write their logic in our DSL.
State management
The state management is to store/restore the state of monitoring state during streaming processing. A couple of aspects included
...
- Whole topology state would be used to restore when a topology restarted.
- Single bolt status restore:
When a bolt is started, it would try to load from the snapshot store where the snapshot match the current bolt's policy acceptance.
Exactly once semantic
// TBD
Persistence
As a general monitoring tool, eagle not mean to store even point of metric into storage, user have to define a time window to reduce the gratuity of the metrics. This is user written CEP-QL above.
...
Info |
---|
{ "name": "hbase-default", "type": "hbase", "connectionString" : "", "props": {...} } |
Metric API
Currently, this metric API is left highly coupled with the underlying storage, for hbase metric, use the eagle metric API. For druid, use the druid query API. Ideally, user might use SQL-style query to get the metrics.