Discussion thread
Vote thread
ISSUE
Release	TBD

Motivation

Table statistics describe the data distribution characteristics of a table. Common statistics include the number of rows, table size, column statistics and more. They are very important for DBMS, especially when executing query plans and optimizing query performance.

Currently, Paimon's statistical information is mainly of two types: one stored in snapshot, and the other in dataFileMeta in the manifest, which have fewer types of statistics.

This PIP will further expand on the existing statistics of Paimon to support more statistical information.

Proposed Changes

What is stats

The statistics are mainly categorized into two types: global level stats and col level stats.

global level stats:

mergedRecordCount: The total number of records after merge (similar to count(*) semantics). For example, in a primary key (PK) table, if there were originally 10 records and a new record with an existing primary key is inserted, the mergedRecordCount would still be 10.
mergedRecordSize: The size of the mergedRecordCount in bytes.

col level stats:

distinctCount: The number of distinct values.
min: The minimum value of the column.
max: The maximum value of the column.
nullCount: The number of nulls.
avgLen: Average column length.
maxLen: Maximum column length.

Storage

snapshot

Add a new field `statistics` in snapshot to represent the stats file name which store the stats:

snapshot-x

{ 
  ... 
  "statistics" : "stats-87effd5d-48fd-4aab-81fe-4222b847d247-0"
}

`statistics` dir

Store statistics in a stats file in a separate statistics directory:

table_path
└── statistics
    └── stats-87effd5d-48fd-4aab-81fe-4222b847d247-0

A stats file like (in json):

stats-87effd5d-48fd-4aab-81fe-4222b847d247-0

{
  "snapshotId": "10",
  "mergedRecordCount" : 10,
  "mergedRecordSize" : 1000,
  "colStats" : {
    "orderId" : {
      "distinctCount" : 10,
      "min" : "1",
      "max" : "10",
      "nullCount" : 0,
      "avgLen" : 8,
      "maxLen" : 8
    }
  }
}

Note:

SnapshotId indicates which snapshot the statistics were gathered from. It can be used for information display, as well as for other possible strategies, such as "not using the statistics which are n times out of the latest snapshot".
the next snapshot will automatically inherit the stats from the previous snapshot (even outdated statistics can be useful when performing query optimization), but it will not inherit them when the schema changes.

How to collect

spark

Stats can be collected through the following spark-sql statement (introduce a new commitKind：ANALYZE)

ANALYZE TABLE T COMPUTE STATISTICS: collect table level stats
ANALYZE TABLE T COMPUTE STATISTICS FOR COLUMNS: collect table level stats + col level stats

To differentiate from existing commitKind, propose a new commitKind: ANALYZE

How to use

query with stats

spark

By implementing the SupportsReportStatistics interface, Spark will utilize the stats provided by table for CBO (Cost-Based Optimization) query optimization automatically.

Currently, the interface only provides numRows and totalSize. After this PIP, we can expand it to further provide column stats. And the priority of stats is as follows:

Statistics in snapshot (global stats + col stats) with some prunning strategies > stats calculated in real-time from splits (only including numRows and totalSize). The reason is that Spark's current CBO mainly relies on distinctCount.

system table

The saved stats can be obtained by querying the system table.

Expire

The lifecycle of stats is managed alongside snapshots, stats can be cleaned up when no snapshot references them. (Similar to the existing expire statement)

New Class and Interfaces

Stats.class

public class Stats {
    private final long snapshotId;
    private final @Nullable Long mergedRecordCount;
    private final @Nullable Long mergedRecordSize;
    private final @Nullable Map<String, ColStats> colStats;
}

ColStats.class

public class ColStats {
    private final @Nullable Long distinctCount;
    private @Nullable String serializedMin;
    private @Nullable Object min;
    private @Nullable String serializedMax;
    private @Nullable Object max;
    private final @Nullable Long nullCount;
    private final @Nullable Long avgLen;
    private final @Nullable Long maxLen;
}

StatsFileHandler.class

public class StatsFileHandler {
    /* Write stats to a stats file, return the written filename. */
    public String writeStats(Stats stats)

    /* Read stats of the specified snapshot. */
    public Optional<Stats> readStats(long snapshotId) 

    /* Delete stats of the specified snapshot. */
    public void deleteStats(long snapshotId)    
}

FileStoreCommit.class

public interface FileStoreCommit {
    ...
    
    /**
     * Commit new statistics. The {@link Snapshot.CommitKind} of generated snapshot is {@link
     * Snapshot.CommitKind#ANALYZE}.
     */
    void commitStatistics(Stats stats, long commitIdentifier);

    ...
}

Compatibility, Deprecation, and Migration Plan

No

Test Plan

UT and IT

Future Work

Support Flink analyze
More query optimizations based on statistics: Paimon itself, engine integration...
Partition statistics

Rejected Alternatives

Store statistics directly in snapshot

Considering the scalability and avoid bloating the snapshot's size, we separate out a directory specifically for storing statistics.

Name statistics as `stats-snapshotId` and not store statistics link in snapshot

Store the statistics link in snapshot to allow for faster access of statistics and to better to implement lifecycle management of statistics.

Page tree

PIP-14: Paimon statistics

Motivation

Proposed Changes

What is stats

Storage

snapshot

statistics dir

How to collect

spark

How to use

query with stats

spark

system table

Expire

New Class and Interfaces

Compatibility, Deprecation, and Migration Plan

Test Plan

Future Work

Rejected Alternatives

Store statistics directly in snapshot

Name statistics as stats-snapshotId and not store statistics link in snapshot

`statistics` dir

Name statistics as `stats-snapshotId` and not store statistics link in snapshot