Discussion thread | |
---|---|
Vote thread | |
ISSUE | |
Release | TBD |
Motivation
Table statistics describe the data distribution characteristics of a table. Common statistics include the number of rows, table size, column statistics and more. They are very important for DBMS, especially when executing query plans and optimizing query performance.
Currently, Paimon's statistical information is mainly of two types: one stored in snapshot, and the other in dataFileMeta in the manifest, which have fewer types of statistics.
This PIP will further expand on the existing statistics of Paimon to support more statistical information.
Proposed Changes
What is stats
The statistics are mainly categorized into two types: global level stats and col level stats.
global level stats:
- mergedRecordCount: The total number of records after merge (similar to count(*) semantics). For example, in a primary key (PK) table, if there were originally 10 records and a new record with an existing primary key is inserted, the mergedRecordCount would still be 10.
- mergedRecordSize: The size of the mergedRecordCount in bytes.
col level stats:
- distinctCount: The number of distinct values.
- min: The minimum value of the column.
- max: The maximum value of the column.
- nullCount: The number of nulls.
- avgLen: Average column length.
- maxLen: Maximum column length.
Storage
snapshot
Add a new field `statistics` in snapshot to represent the stats file name which store the stats:
{ ... "statistics" : "stats-87effd5d-48fd-4aab-81fe-4222b847d247-0" }
statistics
dir
Store statistics in a stats file in a separate statistics directory:
table_path
└── statistics
└── stats-87effd5d-48fd-4aab-81fe-4222b847d247-0
A stats file like (in json):
{ "snapshotId": "10", "mergedRecordCount" : 10, "mergedRecordSize" : 1000, "colStats" : { "orderId" : { "distinctCount" : 10, "min" : "1", "max" : "10", "nullCount" : 0, "avgLen" : 8, "maxLen" : 8 } } }
Note:
- SnapshotId indicates which snapshot the statistics were gathered from. It can be used for information display, as well as for other possible strategies, such as "not using the statistics which are n times out of the latest snapshot".
- the next snapshot will automatically inherit the stats from the previous snapshot (even outdated statistics can be useful when performing query optimization), but it will not inherit them when the schema changes.
How to collect
spark
Stats can be collected through the following spark-sql statement (introduce a new commitKind:ANALYZE)
ANALYZE TABLE T COMPUTE STATISTICS
: collect table level statsANALYZE TABLE T COMPUTE STATISTICS FOR COLUMNS
: collect table level stats + col level stats
To differentiate from existing commitKind, propose a new commitKind: ANALYZE
How to use
query with stats
spark
By implementing the SupportsReportStatistics
interface, Spark will utilize the stats provided by table for CBO (Cost-Based Optimization) query optimization automatically.
Currently, the interface only provides numRows and totalSize. After this PIP, we can expand it to further provide column stats. And the priority of stats is as follows:
Statistics in snapshot (global stats + col stats) with some prunning strategies > stats calculated in real-time from splits (only including numRows and totalSize). The reason is that Spark's current CBO mainly relies on distinctCount.
system table
The saved stats can be obtained by querying the system table.
Expire
The lifecycle of stats is managed alongside snapshots, stats can be cleaned up when no snapshot references them. (Similar to the existing expire statement)
New Class and Interfaces
public class Stats { private final long snapshotId; private final @Nullable Long mergedRecordCount; private final @Nullable Long mergedRecordSize; private final @Nullable Map<String, ColStats> colStats; }
public class ColStats { private final @Nullable Long distinctCount; private @Nullable String serializedMin; private @Nullable Object min; private @Nullable String serializedMax; private @Nullable Object max; private final @Nullable Long nullCount; private final @Nullable Long avgLen; private final @Nullable Long maxLen; }
public class StatsFileHandler { /* Write stats to a stats file, return the written filename. */ public String writeStats(Stats stats) /* Read stats of the specified snapshot. */ public Optional<Stats> readStats(long snapshotId) /* Delete stats of the specified snapshot. */ public void deleteStats(long snapshotId) }
public interface FileStoreCommit { ... /** * Commit new statistics. The {@link Snapshot.CommitKind} of generated snapshot is {@link * Snapshot.CommitKind#ANALYZE}. */ void commitStatistics(Stats stats, long commitIdentifier); ... }
Compatibility, Deprecation, and Migration Plan
No
Test Plan
UT and IT
Future Work
- Support Flink analyze
- More query optimizations based on statistics: Paimon itself, engine integration...
- Partition statistics
Rejected Alternatives
Store statistics directly in snapshot
Considering the scalability and avoid bloating the snapshot's size, we separate out a directory specifically for storing statistics.
Name statistics as stats-snapshotId
and not store statistics link in snapshot
Store the statistics link in snapshot to allow for faster access of statistics and to better to implement lifecycle management of statistics.