You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 17 Next »

Motivation

For now, Ignite has not build-in profiling tool for user's operations and internal processes. Such a tool will be able to collect performance statistics and create a human-readable report. It will help to analyze workload and to tune configuration and applications.

Example of similar tools in other products: AWR [1] [2] [3] (Oracle) ; pgbadger [4], pgmetrics [5], powa [6] (PostgresSQL).

Description

We should provide a way to execute cluster profiling. Consider the following scenario:

  • Enable profiling mode.
  • Executes some arbitrary workload.
  • Collects profiling info.
  • Run the tool that will create the Report contains statistics of workload.

Proposed Changes

The Ignite will provide public facade to manage profiling mode:

ignite.profiling().enable(); // Turns on profiling mode.
ignite.profiling().disable(); // Turns off profiling mode.

ignite.profiling().isEnabled(); // Is profiling mode turn on?

Profiling mode can be managed from CLI (And JMX):

control.sh --profiling  // Prints current profiling mode status.
control.sh --profiling enable // Turns on profiling mode.
control.sh --profiling disable // Turns off profiling mode.

The Ignite will provide the public SPI interface (ProfilingSpi) to log statistics. It can be configured via IgniteConfiguration. It describes follow methods:

  • startProfiling(); // Starts profiling.
  • stopProfiling(); // Stops profiling.
  • log(String info); // Logs operation statistics.

The internal processor (ProfilingProcessor) will be used to manage profiling whole cluster. It will be availible from KernalContext.

The new ignite-profiling module will contain:

  • Default Implementation (LogProfilingSpiImpl) based on async logging to the configured file.
  • The script to collect logs from nodes and build the report: report.sh(bat)

Performance report

The performance report will be in a human-readable text (and then in the html) format and should contain:

  • Ignite and plugins versions, profiling start/end time
  • Queries (SQL, scan, ..) timings:
    • Queries that took up the most time
    • Slowest queries
    • Most frequent queries
    • Failing queries
  • User tasks timings (similar to queries timings)
  • Cache operations timings:
    • Get
    • Put
    • Remove
    • RemoveAndGet
    • PutAndGet
    • Invoke
    • Lock
    • create/destroy caches
  • Transactions commit/rollback timings
  • Checkpoints statistics
  • PME statistics

Also, statistics will be aggregated per nodes.

Additional investigation required to gather following statistics:

  • CPU time per query/task/cache operations
  • Disk, I/O wait/read/write per query/task/cache operations
  • Query parse time
  • Lock time
  • User time
  • Messages process timings

This statistics will provide:

  • Top query/operations by CPU time
  • Top query/operations by IO time
  • What operations use most resources?

Phase 1

On the first phase will be implemented:

  • Profiling public API and default implementation
  • Java API, CLI, JMX process management
  • Gathering overall and time statistics of queries, tasks, cache operations, checkpoints and PME's.
  • Tool to create the report

Phase 2

On the second phase will be investigated and implemented:

  • Gathering CPU time per operation
  • Gathering I/O wait time, read/write counting
  • Lock time per operation
  • Display of these statistics in the report

Public API changes

The new interface will be added: ProfilingSpi.

The new ignite facade will be added: ignite.profiling().

The new module will be created: ignite-profiling.

Corner cases

Node left during profiling

Node left will not affect to the cluster profiling mode.

Node join during profiling

Joining node will set up profiling mode from DiscoveryDataBag provided by the cluster.

Risks and Assumptions

Enabled profiling mode will cause performance degradation.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

  1. https://docs.oracle.com/cd/E11882_01/server.112/e41573/autostat.htm#PFGRF94176
  2. http://www.dba-oracle.com/t_sample_awr_report.htm
  3. http://expertoracle.com/2018/02/06/performance-tuning-basics-15-awr-report-analysis/
  4. https://github.com/darold/pgbadger
  5. https://pgmetrics.io/docs/index.html#example
  6. https://powa.readthedocs.io/en/latest/

Tickets

// Links or report with relevant JIRA tickets.

  • No labels