Nikolay Izhikov
IDIEP-35
Author
Sponsor
Created
Status

IN PROGRESS - Phase 1,2 implemented.


Motivation

For now, Ignite has not full, fragmented monitoring API. Those APIs use different protocols, such as - JMX, Java API, SQL System views, text logs, etc.

From the administrator point of view, it's impossible to understand what is going on in a running cluster:

  • Which user tasks are executed?
  • What resources are used by each task?

It's very hard to export all existing metrics to some arbitrary monitoring system due to a variety of protocols.

The goal of IEP is to provide a way to answer 3 questions:

  • What is running inside the Ignite cluster?
    • An administrator should be able to enlist each user object that was created(ran) inside a cluster via every monitoring interface we will support(JMX, SQL, CLI, etc)
    • An administrator should be able to identify the source of each user object via some ID or other user-provided info.
  • What is running slow?
    • If some user code execution violates configured thresholds handler of such events should be executed. By default, the handler should print WARN log message with all available information about a slow piece of user provided code.
  • What will be running slow?
    • We should provide a way to execute cluster profiling. Consider the following scenario:
      • Enable profiling mode.
      • Executes some arbitrary workload.
      • Collects profiling info.
      • Run some Ignite-provided tool that will create the Report contains statistics of workload. Examples of such tool are:
        • Oracle AWR
        • PostgreSQL pgBadger

Description:


Phase1: What is running inside the Ignite cluster? + What is running slow?

1. We should add some entities in Ignite:

  1. MetricRegistry - Ignite subsystem that provides some set of sensors and lists.
    1. Cache,
    2. Compute,
    3. ServiceGrid,
    4. etc.
  2. Metric - some named number with a well-defined algorithm to calculate the value in any given moments in time. 

    class Metric {
        String name; //EntryCount, MemoryAvailable, etc
        long value; //or double
        Collection<Tuple2<String, String>> labels; //hostName, cacheName, etc.
    }
    
    class LongMetric extends Metric {
    	long ts; //timestamp of the last value update.
    }
  3. SystemView - some named list that contains info about Ignite objects. Examples: List of caches, Transactions list, List of nodes, List of running queries, Las N queries, etc...
  4. MonitoringEvent - generated when some user-defined code violates the threshold.

    class MonitoringEvent {
        MonitoringEventType type; //Event type.
    	T info; //Event info. Type of info differs for different type of events.
    }

2. GridMetricManager, GridSystemViewManager: 

  1. GridMetricManager - should be able to store and query Ignite metrics.
  2. GridSystemViewManager - should be able to store and export SystemViews.

3. Exporters:

Specific interfaces will be supported through exporters.
Exporters should work only with a read-only version of GridMetricManager and don't rely on other knowledge about Ignite internals.

Example of exporters:

  1. JMX
  2. HTTP
  3. SQL System View
  4. Log
  5. etc.

Lists of Ignite objects/entities that should be listed in Phase 2

  1. A list of compute tasks:
    1. Closures
    2. Map-reduce jobs
    3. ComputeJob
    4. Scheduled tasks
  2. Service grid:
    1. A list of services with deployment status
  3. Caches
  4. Cache groups
  5. Cluster nodes
  6. SQL objects
    1. Schemas
    2. Tables
    3. Views
    4. Tables columns
    5. Views columns
    6. Indexes
  7. Queries:
    1. SQL
    2. Scan
    3. Text
    4. ContinousQuery
  8. IgniteCache#invoke
  9. put, get, remove, replace, clear operations
  10. Transactions with lock list
  11. DataStreamers
  12. Explicit locks(IgniteCache#lock)
  13. DataStructures
    1. Queue
    2. Set
    3. AtomicLong
    4. AtomicReference
    5. CountDownLatch
    6. Sequence
    7. Semaphore
  14. Message topics (IgniteMessaging)
  15. Thin client connections.
  16. Machine Learning - ???

Internal Data Structures and Processes we should provide info for

  1. PME queue
  2. Service exchange queue
  3. Security events

Risks and Assumptions

Backward compatibility is in danger with these changes.

We should consider implementing this IEP as Ignite 3.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Proof-of-concept-td41904.html

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Current-API-Analysis-td41823.html

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Metrics-configuration-td42478.html

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-GridJobProcessorMetrics-migration-td42415.html#a42441

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Replace-RunningQueryManager-with-GridSystemViewManager-td43794.html

Gap analysis

Current monitoring APIs availability:

Monitoring completely unavailable:

  1. Compute Grid
    1. Some basic number available in ClusterMetrics(getMaximumActiveJobs, getCurrentActiveJobs, etc...)
  2. Service Grid
  3. Data streamers
  4. Distributed Data Structures
  5. Ignite messaging (Ignite#message)
  6. 3-d party storage
  7. ContinuousQuery
  8. MVCC transactions
  9. ML - What should be available?
  10. Explicit locks

Monitoring API available:

  1. Cache
    1. PDS + offheap memory
      1. Ignite#dataRegionMetrics
      2. Ignite#dataStorageMetrics
      3. Ignite#persistentStoreMetrics
    2. Queries
      1. IgniteCache#queryMetrics
      2. IgniteCache#queryDetailMetrics
      3. QueryHistoryMetrics
    3. IgniteCache#mxBean
    4. IgniteCache#localMxBean
  2. SQL
    1. LOCAL_SQL_RUNNING_QUERIES
    2. INDEXES
  3. Transactions
    1. JMX - TransactionMetricsMxBean
    2. JMX - TransactionMXBean
  4. ThinClients
    1. JMX - ClientProcessorMXBean
  5. IoStaticsticsManager, IoStatisticsHolder
  6. GridJobMetricsProcessor
  7. IgniteMBeansManager
  8. IgniteSpiManagementMBean

Design Principles

  1. Sensors should contain only raw values. No aggregation of numeric metrics on Ignite side.
    Min, max, avg and other functions are the matter of external monitoring system.
  2. Every user task should have an ID or name provided by a user on start time that allows association between monitoring info and user code.
    User should be able to find his code reflected in monitoring.
  3. Every user task should have an ID or name of "connectionID"("sessionID", "clientID") or similar.
    User should be able to know that a specific task was triggered by the specific connection(session, client).
  4. No computation to get current values. We should change sensors and lists values when specific events occur.
    When some sensor queries we should only get its value from internal storage. No additional computation involved.
  5. User should be able to enable/disable any Sensor group/List at runtime. Ignite should provide some administrator interface(s) to enable/disable each Sensor Group or List separately.
    No performance penalty for disabled sensors, lists.

Reference Links

https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF027

https://www.oracle.com/technetwork/database/manageability/diag-pack-ow09-133950.pdf

https://github.com/darold/pgbadger

Tickets

key summary type created updated due assignee reporter priority status resolution

JQL and issue key arguments for this macro require at least one Jira application link to be configured

  • No labels