ID | IEP-58 |
Author | Alexander Belyak |
Sponsor | |
Created |
|
Status | DRAFT |
The main goal is to speed up SQL query execution.
Currently, we have only row count statistics to choose optimal query plans.
Cardinality - we always operate with cardinality (not selectivity). For this document, cardinality means a number of different values in some set (column values in partition, node or in the whole table). But there are another opinions: http://dbaparadise.com/2017/08/5-things-to-better-understand-selectivity-and-cardinality/
HLL - basic algorithm to calculate cardinality in a distributed way. In few words: we can get hash from each value and count max leading zeros in all that hashes… “probably? to get 10 leading zeros we should make 2^10 attempts, so the cardinallity of our set is 2^10. See https://github.com/aggregateknowledge/java-hll to get more details.
HeavyHeater - it’s a set of most frequently used values. Useful when, for example, 99% of a table contains only 5 different values, but last percent contain another 10^5 values… With HH(64) we can try to track those 5 values and build more optimal query plans.
Histograms - row count by some subset (usually between min border and max border) of tables rows. It helps to better estimate number of rows, returned by some query expression. There are good survey of different type of histograms in https://docs.google.com/document/d/1JyPodTTwCtXBQWV-A_32Bfa8k0lfV7IX6KbKi0IDVXY/edit .
Raw statistics - statistic with some raw data, which can be necessary to aggregate it. For example - HLL table. It can be used to update local statistics in case of rebalance.
OSC – Object Statistics Coordinator, some node, which responsible for particular objects statistics collection. It can be calculated from object name and we need such separated node just to coordinate statistics collection in one node.
Stage 1 consists of:
1) Local statistics collection to speed up H2 query execution.
2) Statistics persistence to store statistics between restarts.
3) API to collect statistics by hands (by SQL or Java API)
4) View to show collected statistics
These part consists of two processes: statistics collection process itself and acquiring statistics by the client.
Acquiring statistics by client process:
To request statistics following messages are used:
StatsRequestMessage – request to get statistics information. Contains request id; collection of objects to collect statistics by; type
StatsResponseMessage – response with last statistics data. Contains request id and collection of statistics.
Statistics collection process:
Message graph will be:
Messages
To collect statistics following messages are used:
StatsCollectionRequestMessage – request with id; collection of objects (schema, object name, optionally – column names) to collect statistics by; type (local or global request)
StatsCollectionCancelRequestMessage – message to cancel statistics collection. Contains request id only.
StatsPropagationRequest – request to write some statistics to recipient local store. Used to propagate partition statistics to backups store.
Collector can cache local node statistics or request it again to calculate global statistics later. See (Partially statistics update).
As mentioned in pictures there is 3 types of statistics:
1) Partition statistics
2) Local statistics
3) Global statistics
with different collection area. They all can contain the same data (min/max/cardinality/nulls and so on) See (Pluggable statistics).
To save collected statistics proposed to use local meta storage. Server node will store statistics for each partition for potentially each object. Client node (or server, planning query execution plan) will require global statistics in a lazy manner from all server nodes.
After starting each server node will read nothing before local-wide statistics requested.
On getting such request node will aggregate local statistics from partition wide or collect new ones.
On getting global statistics request server node won’t do anything until statistics collection request will be received. On getting such request node will renew statistics only if there is no previously received request for the same objects with smaller uniq reqId.
To manually control statistics collection there are should be the following API methods:
View
To show collected statistics we can add three views:
but each node will show only local statistics it have.
Partition statistics view will have two addition columns:
Global and local statistics view will have the same columns set:
Statistics will be collected by table columns (St1). Possible objects to collect statistics later is: combined columns (for combined index, for example) and condition index itself (index with where clause).
Stage 2 consists of:
1) Add metrics to show: collection time, last collected time/update counter, memory consumption to cache&store statistics.
2) Implement sampling to speedup statistics collection (scan only part of the table rows)
3) Add automatic statistics collection to get actual statistics without users attention.
4) Partially statistics update to speed up statistics update (scan only partitions/nodes with a lot of changes)
To make statistic collection more controlled and visible we can show through JMX metrics about memory, disk space and CPU time consumption by statistics subsystem on each (client and server) node. It can be even the same as statistics view and contains addition API to collect/drop statistics. These can be shown by each object for partition/local/global levels.
To speed up statistics collection we can use sampling. Main idea is to scan not all rows in tables, but some subset of its. To estimate optimal row quantity statistics subsystem should take in account one ( http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.1734&rep=rep1&type=pdf ) and two ( https://ru.wikipedia.org/wiki/Reservoir_sampling ).
There is two similar processes:
From the user point of view, both processes should work in the same way. To track the number of changes in objects we can use the following approach: add one more HLL structure to track inserted, modified or removed keys count and update statistics when this amount grow some border. It meat that for each Nth insert, update, delete should be put into local cardinality tracking structure by taking a hash of its row key. And we should save that structure with statistics in the local meta store.
In addition -
Partially statistics update is another way to speed up statistics update. If a lot of updates will be detected only on some node such node can update it’s own locally statistics and send it to object statistics coordinator to aggregate it later. Thus the whole cluster will be using actual enough statistics without unnecessary scanning old data.
Stage 3 consists of:
H2 independent statistics (its own types and comparators). Implementation will be connected with schema first approach.
Type dependent statistics (for boolean, byte and tiny int we can track each value natively.
Advanced statistics (heavy heaters, histograms and so on).
To avoid H2 from statistics collection engine we need some schema, so this part is blocked by schema first approach. But the main idea is to collect statistics (get and compare values) without h2 code.
The main idea is to collect type optimized statistics. We can split data types by: thin, fixed length, variable length, complex types (like Geometry or JSON in the future).
Thin range vals - we just store count for all possible values and NULLS counter
Fixed length - they all comparable, so we can store the same statistic for all of them: MIN/MAX, CARD, NULLS counter
And avoid storing their size it can be extracted by column type.
Variable-length - they can be split into comparable and relatively incompatible ones:
Variable length comparable - MIN/MAX, CARD, NULLS counter + average length
Variable-length incomparable - we should only store NULLS counter and average length.
TBD: what about histograms and heavy heaters for such types? We can choose some comparison and use histo to estimate “like” queries results count.
Complex types – for now it enough if we treat it like byte[], but in future we can calculate some type-specific statistics metrics.
Advanced statistics mean collecting some histogramms, heavy heaters and any other complex metrics, designed to improve quality of query plans.
Stage 4 consists of
To collect any kind of statistics for:
one should implement its own statistics collector.
Statistics are collect by partitions, then its aggregated locally, then its aggregated globally so each additional statistics should support aggregation.
All you need to do to support new statistics is implement your own statistics collector and pluggable statistic classes. PluggableStatisticsCollector should be able to:
TBD: SamplingSetup (should it return or consume some sampling settings to correct its behaviour with sampling?)
TBD: human-readable representation to show in views.
Just as idea - node can support some API to export/import statistics. It can be usefull for test, benchmark and other purposes.
Just as idea - we can have some object which statistics can be more accurate that just statistics by data tables. For example - conditional indexes(normal index with where clause). That's why we collect not "table" but "object" statistics.
// Describe project risks, such as API or binary compatibility issues, major protocol changes, etc.
// Links to discussions on the devlist, if applicable.
// Links to various reference documents, if applicable.
// Links or report with relevant JIRA tickets.