ID	IEP-58
Author	Alexander Belyak
Sponsor
Created	24 Sep 2020
Status	DRAFT

Motivation

The main goal is to speed up SQL query execution.

Currently, we have only row count statistics to choose optimal query plans.

Description

Glossary

Cardinality - we always operate with cardinality (not selectivity). For this document, cardinality means a number of different values in some set (column values in partition, node or in the whole table). But there are another opinions: http://dbaparadise.com/2017/08/5-things-to-better-understand-selectivity-and-cardinality/

HLL - basic algorithm to calculate cardinality in a distributed way. In few words: we can get hash from each value and count max leading zeros in all that hashes… “probably? to get 10 leading zeros we should make 2^10 attempts, so the cardinallity of our set is 2^10. See https://github.com/aggregateknowledge/java-hll to get more details.

HeavyHeater - it’s a set of most frequently used values. Useful when, for example, 99% of a table contains only 5 different values, but last percent contain another 10^5 values… With HH(64) we can try to track those 5 values and build more optimal query plans.

Histograms - row count by some subset (usually between min border and max border) of tables rows. It helps to better estimate number of rows, returned by some query expression. There are good survey of different type of histograms in https://docs.google.com/document/d/1JyPodTTwCtXBQWV-A_32Bfa8k0lfV7IX6KbKi0IDVXY/edit .

Raw statistics - statistic with some raw data, which can be necessary to aggregate it. For example - HLL table. It can be used to update local statistics in case of rebalance.

OSC – Object Statistics Coordinator, some node, which responsible for particular objects statistics collection. It can be calculated from object name and we need such separated node just to coordinate statistics collection in one node.

Suggested design

Stage 1

Stage 1 consists of:

1) Local statistics collection to speed up H2 query execution.

2) Statistics persistence to store statistics between restarts.

3) API to collect statistics by hands (by SQL or Java API)

4) View to show collected statistics

Statistics collection

These part consists of two processes: statistics collection process itself and acquiring statistics by the client.

Acquiring statistics by client process:

First-time request: query planning node will send stats request to all server nodes.
Getting response:

If requested statistics are presented somewhere - node just return it to client
If statistics are presented on more than one server node - client will use the last one received.
If there are no statistics in all of them - client will choose random server node and require it to collect necessary statistics.

After getting statistics client will cache it and the server node, which sent it to renew statistics from the same node.

To request statistics following messages are used:

StatsRequestMessage – request to get statistics information. Contains request id; collection of objects to collect statistics by; type

StatsResponseMessage – response with last statistics data. Contains request id and collection of statistics.

Statistics collection process:

Automatically or by hand cluster decides to collect object(s) statistics.
Some node (collector) sends to all data nodes for that object command to collect statistics by their part of data. Request will contain keys collection (schema+object name, optionally – with columns).
Each node scan its primary partitions and send collected partition level statistic to backups (just to have backup for statistic data)
Each node aggregate statistics by all it’s primary partitions and send aggregated local statistics back to collector
Collector aggregate local statistics by all nodes and send global statistics by request to any client (or data node) which planning query on that object(s).

Message graph will be:

Messages

To collect statistics following messages are used:

StatsCollectionRequestMessage – request with id; collection of objects (schema, object name, optionally – column names) to collect statistics by; type (local or global request)

StatsCollectionCancelRequestMessage – message to cancel statistics collection. Contains request id only.

StatsPropagationRequest – request to write some statistics to recipient local store. Used to propagate partition statistics to backups store.

Collector can cache local node statistics or request it again to calculate global statistics later. See (Partially statistics update).

As mentioned in pictures there is 3 types of statistics:

1) Partition statistics

2) Local statistics

3) Global statistics

with different collection area. They all can contain the same data (min/max/cardinality/nulls and so on) See (Pluggable statistics).

Statistics persistence

To save collected statistics proposed to use local meta storage. Server node will store statistics for each partition for potentially each object. Client node (or server, planning query execution plan) will require global statistics in a lazy manner from all server nodes.

After starting each server node will read nothing before local-wide statistics requested.

On getting such request node will aggregate local statistics from partition wide or collect new ones.

On getting global statistics request server node won’t do anything until statistics collection request will be received. On getting such request node will renew statistics only if there is no previously received request for the same objects with smaller uniq reqId.

API to collect statistics

To manually control statistics collection there are should be the following API methods:

collect – to collect statistics by some objects. Need to specify schema and object name, and optionally – column names.
drop – to clean statistics

View

To show collected statistics we can add three views:

global_statistics
local_statistics
partition_statistics

but each node will show only local statistics it have.

Partition statistics view will have two addition columns:

partition id
partition state

Global and local statistics view will have the same columns set:

schema name
object name
columns
type: table or index
total size
nulls counter
min/max values
cardinality
Size
Histograms/HeavyHeater and so on
Pluggable statistics

Statistics will be collected by table columns (St1). Possible objects to collect statistics later is: combined columns (for combined index, for example) and condition index itself (index with where clause).

Stage 2

Stage 2 consists of:

1) Add metrics to show: collection time, last collected time/update counter, memory consumption to cache&store statistics.

2) Implement sampling to speedup statistics collection (scan only part of the table rows)

3) Add automatic statistics collection to get actual statistics without users attention.

4) Partially statistics update to speed up statistics update (scan only partitions/nodes with a lot of changes)

Metrics

To make statistic collection more controlled and visible we can show through JMX metrics about memory, disk space and CPU time consumption by statistics subsystem on each (client and server) node. It can be even the same as statistics view and contains addition API to collect/drop statistics. These can be shown by each object for partition/local/global levels.

Sampling

To speed up statistics collection we can use sampling. Main idea is to scan not all rows in tables, but some subset of its. To estimate optimal row quantity statistics subsystem should take in account one ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.1734&rep=rep1&type=pdf ) and two ( https://ru.wikipedia.org/wiki/Reservoir_sampling ).

Automatic statistics collection

There is two similar processes:

determine when start to collect statistics for some object
determine when update collected statistics.

From the user point of view, both processes should work in the same way. To track the number of changes in objects we can use the following approach: add one more HLL structure to track inserted, modified or removed keys count and update statistics when this amount grow some border. It meat that for each Nth insert, update, delete should be put into local cardinality tracking structure by taking a hash of its row key. And we should save that structure with statistics in the local meta store.

In addition -

Partially statistics update

Partially statistics update is another way to speed up statistics update. If a lot of updates will be detected only on some node such node can update it’s own locally statistics and send it to object statistics coordinator to aggregate it later. Thus the whole cluster will be using actual enough statistics without unnecessary scanning old data.

Stage 3

Stage 3 consists of:

H2 independent statistics (its own types and comparators). Implementation will be connected with schema first approach.

Type dependent statistics (for boolean, byte and tiny int we can track each value natively.

Advanced statistics (heavy heaters, histograms and so on).

H2 independent statistics

To avoid H2 from statistics collection engine we need some schema, so this part is blocked by schema first approach. But the main idea is to collect statistics (get and compare values) without h2 code.

Type dependent statistics

The main idea is to collect type optimized statistics. We can split data types by: thin, fixed length, variable length, complex types (like Geometry or JSON in the future).

Thin range vals - we just store count for all possible values and NULLS counter

Boolean (True/False)
Byte (-128 to 127)

Fixed length - they all comparable, so we can store the same statistic for all of them: MIN/MAX, CARD, NULLS counter

Short (-32768 to 32767):
Integer (0x80000000 to 0x7fffffff)
Long (0x8000000000000000L to 0x7fffffffffffffffL)
Float
Double
Time
Timestamp
Date
java.sql.Date
LocalTime
LocalDate
LocalDateTime
UUID (TBD: possibly we have to store just CARD, NULLS counter if it use as foreign key)

And avoid storing their size it can be extracted by column type.

Variable-length - they can be split into comparable and relatively incompatible ones:

Variable length comparable - MIN/MAX, CARD, NULLS counter + average length

BigDecimal

Variable-length incomparable - we should only store NULLS counter and average length.

TBD: what about histograms and heavy heaters for such types? We can choose some comparison and use histo to estimate “like” queries results count.

String
byte[]

Complex types – for now it enough if we treat it like byte[], but in future we can calculate some type-specific statistics metrics.

Advanced statistics

Advanced statistics mean collecting some histogramms, heavy heaters and any other complex metrics, designed to improve quality of query plans.

Stage 4

Stage 4 consists of

Pluggable statistics interface.
Load/Export statistics (for test purposes, for performance testing, to collect statistics on a hot backup cluster and load into primary one)
Index statistics collection (for conditional indexes, if needed)

Pluggable statistics interface

To collect any kind of statistics for:

different storage engines
different query engines
different data types

one should implement its own statistics collector.

Statistics are collect by partitions, then its aggregated locally, then its aggregated globally so each additional statistics should support aggregation.

All you need to do to support new statistics is implement your own statistics collector and pluggable statistic classes. PluggableStatisticsCollector should be able to:

Return collection of supported data types (if there are some limitations)
Add next value
Return collected statistics
Serialize/deserialize it to store as byte[]
Marshall/unmarshal to send as a message
Aggregate collection of pluggable statistics into a single object with “total” values

TBD: SamplingSetup (should it return or consume some sampling settings to correct its behaviour with sampling?)

TBD: human-readable representation to show in views.

Load/Export statistics

Just as idea - node can support some API to export/import statistics. It can be usefull for test, benchmark and other purposes.

Index statistics collection

Just as idea - we can have some object which statistics can be more accurate that just statistics by data tables. For example - conditional indexes(normal index with where clause). That's why we collect not "table" but "object" statistics.

Risks and Assumptions

// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.

Discussion Links

// Links to discussions on the devlist, if applicable.

Reference Links

// Links to various reference documents, if applicable.

Tickets

// Links or report with relevant JIRA tickets.

Page tree

IEP-58: Statistics for SQL query optimization