Nikolay Izhikov

ID	IEP-35
Author	Nikolay Izhikov
Sponsor
Created	Nikolay Izhikov
Status	IN PROGRESS - Phase 1,2 implemented.

Motivation

For now, Ignite has not full, fragmented monitoring API. Those APIs use different protocols, such as - JMX, Java API, SQL System views, text logs, etc.

From the administrator point of view, it's impossible to understand what is going on in a running cluster:

Which user tasks are executed?
What resources are used by each task?

It's very hard to export all existing metrics to some arbitrary monitoring system due to a variety of protocols.

The goal of IEP is to provide a way to answer 3 questions:

What is running inside the Ignite cluster?
- An administrator should be able to enlist each user object that was created(ran) inside a cluster via every monitoring interface we will support(JMX, SQL, CLI, etc)
- An administrator should be able to identify the source of each user object via some ID or other user-provided info.
What is running slow?
- If some user code execution violates configured thresholds handler of such events should be executed. By default, the handler should print WARN log message with all available information about a slow piece of user provided code.
What will be running slow?
- We should provide a way to execute cluster profiling. Consider the following scenario:
  - Enable profiling mode.
  - Executes some arbitrary workload.
  - Collects profiling info.
  - Run some Ignite-provided tool that will create the Report contains statistics of workload. Examples of such tool are:
    - Oracle AWR
    - PostgreSQL pgBadger

Description:

Phase1: What is running inside the Ignite cluster? + What is running slow?

1. We should add some entities in Ignite:

MetricRegistry - Ignite subsystem that provides some set of sensors and lists.
1. Cache,
2. Compute,
3. ServiceGrid,
4. etc.

Metric - some named number with a well-defined algorithm to calculate the value in any given moments in time.

class Metric {
    String name; //EntryCount, MemoryAvailable, etc
    long value; //or double
    Collection<Tuple2<String, String>> labels; //hostName, cacheName, etc.
}

class LongMetric extends Metric {
	long ts; //timestamp of the last value update.
}

SystemView - some named list that contains info about Ignite objects. Examples: List of caches, Transactions list, List of nodes, List of running queries, Las N queries, etc...

MonitoringEvent - generated when some user-defined code violates the threshold.

class MonitoringEvent {
    MonitoringEventType type; //Event type.
	T info; //Event info. Type of info differs for different type of events.
}

2. GridMetricManager, GridSystemViewManager:

GridMetricManager - should be able to store and query Ignite metrics.
GridSystemViewManager - should be able to store and export SystemViews.

3. Exporters:

Specific interfaces will be supported through exporters.
Exporters should work only with a read-only version of GridMetricManager and don't rely on other knowledge about Ignite internals.

Example of exporters:

JMX
HTTP
SQL System View
Log
etc.

Lists of Ignite objects/entities that should be listed in Phase 2

A list of compute tasks:
1. Closures
2. Map-reduce jobs
3. ComputeJob
4. Scheduled tasks
Service grid:
1. A list of services with deployment status
Caches
Cache groups
Cluster nodes
SQL objects
1. Schemas
2. Tables
3. Views
4. Tables columns
5. Views columns
6. Indexes
Queries:
1. SQL
2. Scan
3. Text
4. ContinousQuery
IgniteCache#invoke
put, get, remove, replace, clear operations
Transactions with lock list
DataStreamers
Explicit locks(IgniteCache#lock)
DataStructures
1. Queue
2. Set
3. AtomicLong
4. AtomicReference
5. CountDownLatch
6. Sequence
7. Semaphore
Message topics (IgniteMessaging)
Thin client connections.
Machine Learning - ???

Internal Data Structures and Processes we should provide info for

PME queue
Service exchange queue
Security events

Risks and Assumptions

Backward compatibility is in danger with these changes.

We should consider implementing this IEP as Ignite 3.

Discussion Links

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Proof-of-concept-td41904.html

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-Monitoring-amp-Profiling-Current-API-Analysis-td41823.html

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Metrics-configuration-td42478.html

http://apache-ignite-developers.2346864.n4.nabble.com/IEP-35-GridJobProcessorMetrics-migration-td42415.html#a42441

http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-IEP-35-Replace-RunningQueryManager-with-GridSystemViewManager-td43794.html

Gap analysis

Current monitoring APIs availability:

Monitoring completely unavailable:

Compute Grid
1. Some basic number available in ClusterMetrics(getMaximumActiveJobs, getCurrentActiveJobs, etc...)
Service Grid
Data streamers
Distributed Data Structures
Ignite messaging (Ignite#message)
3-d party storage
ContinuousQuery
MVCC transactions
ML - What should be available?
Explicit locks

Monitoring API available:

Cache
1. PDS + offheap memory
  1. Ignite#dataRegionMetrics
  2. Ignite#dataStorageMetrics
  3. Ignite#persistentStoreMetrics
2. Queries
  1. IgniteCache#queryMetrics
  2. IgniteCache#queryDetailMetrics
  3. QueryHistoryMetrics
3. IgniteCache#mxBean
4. IgniteCache#localMxBean
SQL
1. LOCAL_SQL_RUNNING_QUERIES
2. INDEXES
Transactions
1. JMX - TransactionMetricsMxBean
2. JMX - TransactionMXBean
ThinClients
1. JMX - ClientProcessorMXBean
IoStaticsticsManager, IoStatisticsHolder
GridJobMetricsProcessor
IgniteMBeansManager
IgniteSpiManagementMBean

Design Principles

Sensors should contain only raw values. No aggregation of numeric metrics on Ignite side.
Min, max, avg and other functions are the matter of external monitoring system.
Every user task should have an ID or name provided by a user on start time that allows association between monitoring info and user code.
User should be able to find his code reflected in monitoring.
Every user task should have an ID or name of "connectionID"("sessionID", "clientID") or similar.
User should be able to know that a specific task was triggered by the specific connection(session, client).
No computation to get current values. We should change sensors and lists values when specific events occur.
When some sensor queries we should only get its value from internal storage. No additional computation involved.
User should be able to enable/disable any Sensor group/List at runtime. Ignite should provide some administrator interface(s) to enable/disable each Sensor Group or List separately.
No performance penalty for disabled sensors, lists.

Reference Links

https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF027

https://www.oracle.com/technetwork/database/manageability/diag-pack-ow09-133950.pdf

https://github.com/darold/pgbadger

Tickets

key	summary	type	created	updated	due	assignee	reporter	priority	status	resolution
JQL and issue key arguments for this macro require at least one Jira application link to be configured

Page tree

IEP-35 Monitoring & Profiling