CategoryDec. 9thDec. 23thJan. 6thJan. 20thFeb. 17thMar. 3rdMar. 17thMar. 31
Service Availability   

1.1 service availability trend dashboard

    
JPM 

4.1 Daily job failure report (Dec. 16th)

4.2 Job failure diagnostics (task failure category)

4.2 Task Failing Nodes List and bad node detection

4.3 SLA job pre-caution (regularly scheduled job identification)


4.4 Job comparison w.r.t. counters difference

4.5 Job optimization suggestion (GC, patition, compression, skew, ...)

4.6 Queue Monitoring

4.7 Spark Job Monitoring

 
Master Node JMX 

2.1 Migrate JMX policies from Eagle v0.3 to v0.5 for Apollo, Ares and Artemis

 

2.2 Master Node JMX Dashboard (Beta)

2.2 Master Node JMX Dashboard (Beta)   
Slave Node JMX Metrics

3.1 JMX Metrics Retrieval Script

3.2 RS JMX Metrics on-boarding for Apollo HBase

 

3.3 RS JMX Metrics Alerting

 

3.4 RS JMX Dashboard (Beta)

    
Slave Node System Metrics  

5.1 System metrics data on-boarding

 

  

5.2 System metrics dashboard

  
DAM  

6.1 HDFS Audit log traffic monitoring

 

     
Service Log Processing        
Production Deployment 

7.1 Capability to integrate with SEC

      
Eagle Improvements

8.1 stabilize alert engine

 

8.2 Simplify alert configuration

     
  • No labels