Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Number of Rows
  • Number of files
  • Size in Bytes.
    For tables, the same statistics are supported with the addition of the number of partitions of the table.

Column level top K values can also be gathered a long with partition level statistics. See Top K Statistics.

Implementation

The way the statistics are calculated is similar for both newly created and existing tables.
For newly created tables, the job that creates a new table is a MapReduce job. During the creation, every mapper while copying the rows from the source table in the FileSink operator, gathers statistics for the rows it encounters and publishes them into a Database (possibly MySQL). At the end of the MapReduce job, published statistics are aggregated and stored in the MetaStore.
A similar process happens in the case of already existing tables, where a Map-only job is created and every mapper while processing the table in the TableScan operator, gathers statistics for the rows it encounters and the same process continues.

...