Introducing Apache Eagle

Eagle is a highly extensible, scalable monitoring and alerting platform, designed with its flexible application framework and proven big data technologies, such as Kafka, Spark and Storm. It ships a rich set of applications for big data platform monitoring, e.g. HDFS/HBase/YARN service health check, JMX metrics, daemon logs, audit logs and yarn applications. External Eagle developers can define applications to monitoring their NoSQLs or Web Servers, and publish to Eagle application repository at your own discretion. It also provides the state-of-art alert engine to report security breaches, service failures, and application anomalies, highly customizable by the alert policy definition.

Terminology

Site
A virtual concept in Apache Eagle. You can use it to manage a group of application instances, and distinguish the applications if you have a certain application installed for multiple times.

Application
Application is the first-class citizen in Apache Eagle, it stands for an end-to-end monitoring/alerting solution, which usually contains the monitoring source onboarding, source schema specification, alerting policy and dashboard definition.

Stream
Stream is the input for Alert Engine, each Application should have its own stream to be defined.

Data Activity Monitoring
The built-in application to monitor HDFS/HBase/Hive operations, and allow users to define certain policies to capture security breached in real-time.

Alert Engine
A specific application shared for all other monitoring applications, it reads data from Kafka, and processes the data by applying the policy in real-time manner, and generates alert notification. So we call this application as the Alert Engine.

Policy
A rule used by Alert Engine to matching the data input from Kafka.

Alert
If any data input to Alert Engine meets the policy, the Alert Engine will generate a message to a notification channel. We call those messages as the alerts.

Notification Channel
The channel where alert are sent to, it can be the SMTP channel or the Kafka channel.

Key Qualities

Extensible
Eagle built the core framework around the application concept, everything has its runtime logic is built into an application. Developer can easily develop his own out-of-box monitoring application using Eagle application framework, and deploy into Eagle.
Scalable
The Eagle core team choose the proven big data technologies to build the fundamental runtime, like the distributed
Real-time
Storm or Spark Streaming based computing engine allow us to apply the policy to input data and generate alerts in real-time manner.
Dynamic
Eagle user can dynamically change their alert policies without any impact to the underlying runtime.
Easy-of-Use
User can enable the monitoring for a service within minutes effort by just choose the built-in monitoring application and configure few parameters for the service.
Non-Invasive
Apache Eagle uses the out-of-box applications to monitor services, you don't need any change to those services.

Example Use Cases

Data Activity Monitoring

Data activity represents how user explores data provided by big data platforms. Analyzing data activity and alerting for insecure access are fundamental requirements for securing enterprise data. As data volume is increasing exponentially with Hadoop, Hive, Spark technology, understanding data activities for every user becomes extremely hard, let alone to alert for a single malicious event in real time among petabytes streaming data per day.

Securing enterprise data starts from understanding data activities for every user. Apache Eagle (incubating, called Eagle in the following) has integrated with many popular big data platforms e.g. Hadoop, Hive, Spark, Cassandra etc. With Eagle user can browse data hierarchy, mark sensitive data and then create comprehensive policy to alert for insecure data access.

Job Performance Analysis

Running map/reduce job is the most popular way people use to analyze data in Hadoop system. Analyzing job performance and providing tuning suggestions are critical for Hadoop system stability, job SLA and resource usage etc.
Eagle analyzes job performance with two complementing approaches. First Eagle periodically takes snapshots for all running jobs with YARN API, secondly Eagle continuously reads job lifecycle events immediately after the job is completed. With the two approaches, Eagle can analyze single job's trend, data skew problem, failure reasons etc. More interestingly, Eagle can analyze whole Hadoop cluster's performance by taking into account all jobs.

Cluster Performance Analytics

It is critical to understand why a cluster performs bad. Is that because of some crazy jobs recently on-boarded, or huge amount of tiny files, or namenode performance degrading?
Eagle in realtime calculates resource usage per minute out of individual jobs, e.g. CPU, memory, HDFS IO bytes, HDFS IO numOps etc. and also collects namenode JMX metrics. Correlating them together will easily help system administrator find root cause for cluster slowness.

Page tree

Introduction