Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Enable profiling mode.
  • Executes some arbitrary workload.
  • Collects profiling info.
  • Run the tool that will create the Report contains statistics of workload.

Proposed Changes

The Ignite will provide public facade to manage profiling mode:

ignite.profiling().enable(); // Turns on profiling mode.
ignite.profiling().disable(); // Turns off profiling mode.

ignite.profiling().isEnabled(); // Is profiling mode turn on?

Profiling mode can be managed from CLI (And JMX):

control.sh --profiling  // Prints current profiling mode status.
control.sh --profiling enable // Turns on profiling mode.
control.sh --profiling disable // Turns off profiling mode.

The Ignite will provide the public SPI interface (ProfilingSpi) to log statistics. It can be configured via IgniteConfiguration. It describes follow methods:

  • startProfiling(); // Starts profiling.
  • stopProfiling(); // Stops profiling.
  • log(String info); // Logs operation statistics.

The internal processor (ProfilingProcessor) will be used to manage profiling whole cluster. It will be availible from KernalContext.

The new ignite-profiling module will contain:

  • Default Implementation (LogProfilingSpiImpl) based on async logging to the configured file.
  • The script to collect logs from nodes and build the report: report.sh(bat)

Performance report

The performance report will be in a human-readable text format (and then in the html page) format and should contain:

  • Ignite and plugins versions, topology changes, profiling start/end time
  • Queries (SQL, scan, ..) timings, resources:
    • Queries that took up the most time
    • Slowest queries
    • Most frequent queries
    • Failing queries
    • Queries count by type
    • Queries that took up the most CPU/IO/Disk
    • Failing queries
  • User tasks statistics (similar to queries)
  • User tasks timings, resources
    • Jobs of slowest tasks
  • Caches and cache operations statistics:
    • Get/Put/Remove
    • Transactions
    Cache operations statistics (similar to queries):
    • Get
    • Put
    • Remove
    • RemoveAndGet
    • PutAndGet
    • Invoke
    • Lock
    • create/destroy caches
    Transactions commit/rollback timings
  • Workload by nodes
    • CPU/IO/Disk resources
  • Checkpoints statistics
  • PME WAL statistics

...

  • PME statistics

Additional investigation required to gather following statistics:

  • Query parse time
  • Lock waiting time
  • User time
  • Messages process timings

This statistics will provide:

  • Top query/operations by CPU time
  • Top query/operations by IO time
  • What operations use most resources?

Phase 1

On the first phase will be implemented:

  • Profiling public API and default implementation
  • Java API, CLI, JMX process management
  • Gathering overall and time statistics of queries, tasks, cache operations, checkpoints and PME's.
  • Tool to create the report

Phase 2

On the second phase will be investigated and implemented:

  • Gathering CPU time per operation
  • Gathering I/O wait time, read/write counting
  • Lock time per operation
  • Display of these statistics in the report

Public API changes

The new interface will be added: ProfilingSpi.

The new ignite facade will be added: ignite.profiling().

The new module will be created: ignite-profiling.

Corner cases

Node left during profiling

Node left will not affect to the cluster profiling mode.

Node join during profiling

...

Proposed Changes

The Ignite will log some additional internal performance statistics to profiling files. The format is like WAL logging.

One disk-writer thread and off-heap memory buffer will be used to minimize affect on performance. Maximum file size and buffer size can be configured on start.

The new extension performance-statistics-ext module will be introduced. It will contain the tool to build the report: build-report.sh(bat). The JSON format is used to store aggregated statistics and next draw in the report.

The report is based on the bootstrap library and can be viewed in a browser offline.

Management

1) JMX: 

PerformanceStatisticsMBean
  • void start() // Start collecting performance statistics in the cluster.
  • void stop() // Stop collecting performance statistics in the cluster.
  • boolean enabled() // True if collecting performance statistics enabled.

2) Control.sh utility. Functionality is like JMX.

3) System properties:

  • IGNITE_PERF_STAT_FILE_MAX_SIZE - Performance statistics maximum file size in bytes. Performance statistics will be stopped when the size exceeded.
  • IGNITE_PERF_STAT_BUFFER_SIZE - Performance statistics offheap buffer size in bytes.
  • IGNITE_PERF_STAT_FLUSH_SIZE - Performance statistics minimal batch size to flush in bytes.
  • IGNITE_PERF_STAT_CACHED_STRINGS_THRESHOLD - Performance statistics maximum cached strings threshold. String caching will stop on threshold excess.

Risks and Assumptions

Enabled profiling mode will cause performance degradation.

Discussion Links

// Links to discussions on the devlist, if applicable.

Dev-list discussion.

Report example



Image AddedImage AddedImage AddedImage AddedImage Added



Reference Links


  1. https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF94176
  2. http://www.dba-oracle.com/t_sample_awr_report.htm
  3. http://expertoracle.com/2018/02/06/performance-tuning-basics-15-awr-report-analysis/
  4. https://github.com/darold/pgbadger
  5. https://pgmetrics.io/docs/index.html#example
  6. https://powa.readthedocs.io/en/latest/

Tickets

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyIGNITE-12666
// Links or report with relevant JIRA tickets.