ID | IEP-35 |
Author | |
Sponsor | Nikolay Izhikov
|
Created | |
Status | DRAFT |
Motivation
For now, Ignite has not full, fragmented monitoring API. Those APIs uses different protocols, such as - JMX, Java API, SQL System views, text logs, etc.
From the administrator point of view, it's impossible to understand what is going on in a running cluster:
- Which user tasks are executed?
- What resources are used by each task?
It's very hard to export all existing metrics to some arbitrary monitoring system due to a variety of protocols.
The goal of IEP is to provide a way to answer 3 questions:
- What is running inside the Ignite cluster?
- An administrator should be able to enlist each user object that was created(ran) inside a cluster via every monitoring interface we will support(JMX, SQL, CLI, etc)
- An administrator should be able to identify the source of each user object via some ID or other user-provided info.
- What is running slow?
- If some user code execution violates configured thresholds handler of such events should be executed. By default, the handler should print WARN log message with all available information about a slow piece of user provided code.
- What will be running slow?
- We should provide a way to execute cluster profiling. Consider the following scenario:
- Enable profiling mode.
- Executes some arbitrary workload.
- Collects profiling info.
- Run some Ignite-provided tool that will create the Report contains statistics of workload. Examples of such tool are:
- Oracle AWR
- PostgreSQL pgBadger
Description:
Phase1: What is running inside the Ignite cluster? + What is running slow?
1. We should add some entities in Ignite:
- MetricDomain - Ignite subsystem that provides some set of sensors and lists.
- Cache,
- Compute,
- ServiceGrid,
- etc.
Sensor - some named number with a well-defined algorithm to calculate the value in any given moments in time.
class Sensor {
String name; //EntryCount, MemoryAvailable, etc
long value; //or double
Collection<Tuple2<String, String>> labels; //hostName, cacheName, etc.
}
class TimeSensor extends Sensor {
long ts; //timestamp of the last value update.
}
- List - some named list of string that contains info about Ignite objects. Examples: List of caches, Transaction list, List of nodes, List of running queries, Las N queries, etc...
MonitoringEvent - generated when some user-defined code violates the threshold.
class MonitoringEvent {
MonitoringEventType type; //Event type.
T info; //Event info. Type of info differs for different type of events.
}
2. SensorProcessor, MonitoringEventProcessor:
- SensorProcessor - should be able to store and query Ignite sensors.
- MonitoringEventProcessor - should be able to set up event listeners, watch for user code executions and route events.
3. Exposers:
Specific admin interfaces will be supported through exposers.
Exposer should work only with SensorProcessor and don't rely on other knowledge about Ignite internals.
- PullExposer - this type of exposers should respond on user query via some interface
- JMX
- HTTP
- SQL
- Java
- etc.
- PushExposer - this type of exposers should export sensors and list to some external system based on the configured schedule.
- LogExposer
- Integration with proprietary monitoring system can be implemented as PushExposer.
List of API that should be listed in Phase 1
- Compute tasks:
- Closures
- Map-reduce jobs
- ComputeJob
- Scheduled tasks
- Service grid:
- Services with deployment status
- Queries:
- SQL
- Scan
- Text
- ContinousQuery
- IgniteCache#invoke
- put, get, remove, replace, clear operations
- Transactions with lock list
- DataStreamers
- Explicit locks(IgniteCache#lock)
- DataStructures
- Queue
- Set
- AtomicLong
- AtomicReference
- CountDownLatch
- Sequence
- Semaphore
- Message topics (IgniteMessaging)
- Thin client connections.
- Machine Learning - ???
Risks and Assumptions
Backward compatibility is in danger with these changes.
We should consider implementing this IEP as Ignite 3.
Discussion Links
// Links to discussions on the devlist, if applicable.
Gap analysis
Current monitoring APIs availability:
Monitoring completely unavailable:
- Compute Grid
- Service Grid
- Data streamers
- Distributed Data Structures
- Ignite messaging (Ignite#message)
- 3-d party storage
- ContinuousQuery
- MVCC transactions
- ML - What should be available?
- Explicit locks
Monitoring API available:
- Cache
- PDS + offheap memory
- Ignite#dataRegionMetrics
- Ignite#dataStorageMetrics
- Ignite#persistentStoreMetrics
- Queries
- IgniteCache#queryMetrics
- IgniteCache#queryDetailMetrics
- QueryHistoryMetrics
- IgniteCache#mxBean
- IgniteCache#localMxBean
- SQL
- LOCAL_SQL_RUNNING_QUERIES
- INDEXES
- Transactions
- JMX - TransactionMetricsMxBean
- JMX - TransactionMXBean
- ThinClients
- JMX - ClientProcessorMXBean
Reference Links
https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF027
https://www.oracle.com/technetwork/database/manageability/diag-pack-ow09-133950.pdf
https://github.com/darold/pgbadger
Tickets
// Links or report with relevant JIRA tickets.