Nikolay Izhikov

ID	IEP-35
Author	Nikolay Izhikov
Sponsor
Created	Nikolay Izhikov
Status	DRAFT

Motivation

For now, Ignite has not full, fragmented monitoring API. Those APIs uses different protocols, such as - JMX, Java API, SQL System views, text logs, etc.

From the administrator point of view, it's impossible to understand what is going on in a running cluster:

Which user tasks are executed?
What resources are used by each task?

It's very hard to export all existing metrics to some arbitrary monitoring system due to a variety of protocols.

The goal of IEP is to provide a way to answer 3 questions:

What is running inside the Ignite cluster?
- An administrator should be able to enlist each user object that was created(ran) inside a cluster via every monitoring interface we will support(JMX, SQL, CLI, etc)
- An administrator should be able to identify the source of each user object via some ID or other user-provided info.
What is running slow?
- If some user code execution violates configured thresholds handler of such events should be executed. By default, the handler should print WARN log message with all available information about a slow piece of user provided code.
What will be running slow?
- We should provide a way to execute cluster profiling. Consider the following scenario:
  - Enable profiling mode.
  - Executes some arbitrary workload.
  - Collects profiling info.
  - Run some Ignite-provided tool that will create the Report contains statistics of workload. Examples of such tool are:
    - Oracle AWR
    - PostgreSQL pgBadger

Description:

Phase1: What is running inside the Ignite cluster? + What is running slow?

1. We should add some entities in Ignite:

MetricDomain - Ignite subsystem that provides some set of sensors and lists.
1. Cache,
2. Compute,
3. ServiceGrid,
4. etc.

Sensor - some named number with a well-defined algorithm to calculate the value in any given moments in time.

class Sensor {
    String name; //EntryCount, MemoryAvailable, etc
    long value; //or double
    Collection<Tuple2<String, String>> labels; //hostName, cacheName, etc.
}

class TimeSensor extends Sensor {
	long ts; //timestamp of the last value update.
}

List - some named list of string that contains info about Ignite objects. Examples: List of caches, Transaction list, List of nodes, List of running queries, Las N queries, etc...

MonitoringEvent - generated when some user-defined code violates the threshold.

class MonitoringEvent {
    MonitoringEventType type; //Event type.
	T info; //Event info. Type of info differs for different type of events.
}

2. SensorProcessor, MonitoringEventProcessor:

SensorProcessor - should be able to store and query Ignite sensors.
MonitoringEventProcessor - should be able to set up event listeners, watch for user code executions and route events.

3. Exposers:

Specific admin interfaces will be supported through exposers.
Exposer should work only with SensorProcessor and don't rely on other knowledge about Ignite internals.

PullExposer - this type of exposers should respond on user query via some interface
1. JMX
2. HTTP
3. SQL
4. Java
5. etc.
PushExposer - this type of exposers should export sensors and list to some external system based on the configured schedule.
1. LogExposer
2. Integration with proprietary monitoring system can be implemented as PushExposer.

List of API that should be listed in Phase 1

Compute tasks:
1. Closures
2. Map-reduce jobs
3. ComputeJob
4. Scheduled tasks
Service grid:
1. Services with deployment status
Queries:
1. SQL
2. Scan
3. Text
4. ContinousQuery
IgniteCache#invoke
put, get, remove, replace, clear operations
Transactions with lock list
DataStreamers
Explicit locks(IgniteCache#lock)
DataStructures
1. Queue
2. Set
3. AtomicLong
4. AtomicReference
5. CountDownLatch
6. Sequence
7. Semaphore
Message topics (IgniteMessaging)
Thin client connections.
Machine Learning - ???

Internal Data Structures and Processes we should provide info for

PME queue
Service exchange queue
Security events

Risks and Assumptions

Backward compatibility is in danger with these changes.

We should consider implementing this IEP as Ignite 3.

Discussion Links

// Links to discussions on the devlist, if applicable.

Gap analysis

Current monitoring APIs availability:

Monitoring completely unavailable:

Compute Grid
1. Some basic number available in ClusterMetrics(getMaximumActiveJobs, getCurrentActiveJobs, etc...)
Service Grid
Data streamers
Distributed Data Structures
Ignite messaging (Ignite#message)
3-d party storage
ContinuousQuery
MVCC transactions
ML - What should be available?
Explicit locks

Monitoring API available:

Cache
1. PDS + offheap memory
  1. Ignite#dataRegionMetrics
  2. Ignite#dataStorageMetrics
  3. Ignite#persistentStoreMetrics
2. Queries
  1. IgniteCache#queryMetrics
  2. IgniteCache#queryDetailMetrics
  3. QueryHistoryMetrics
3. IgniteCache#mxBean
4. IgniteCache#localMxBean
SQL
1. LOCAL_SQL_RUNNING_QUERIES
2. INDEXES
Transactions
1. JMX - TransactionMetricsMxBean
2. JMX - TransactionMXBean
ThinClients
1. JMX - ClientProcessorMXBean
IoStaticsticsManager, IoStatisticsHolder

Design Principles

Sensors should contain only raw values. No aggregation of numeric metrics on Ignite side.
Min, max, avg and other functions are the matter of external monitoring system.
Every user task should have an ID or name provided by a user on start time that allows association between monitoring info and user code.
User should be able to find his code reflected in monitoring.
Every user task should have an ID or name of "connectionID"("sessionID", "clientID") or similar.
User should be able to know that a specific task was triggered by the specific connection(session, client).
No computation to get current values. We should change sensors and lists values when specific events occur.
When some sensor queries we should only get its value from internal storage. No additional computation involved.
User should be able to enable/disable any Sensor group/List at runtime. Ignite should provide some administrator interface(s) to enable/disable each Sensor Group or List separately.
No performance penalty for disabled sensors, lists.

Reference Links

https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF027

https://www.oracle.com/technetwork/database/manageability/diag-pack-ow09-133950.pdf

https://github.com/darold/pgbadger

Tickets

// Links or report with relevant JIRA tickets.

Page tree

IEP-35 Monitoring & Profiling