Status

Discussion threadhttps://lists.apache.org/thread/sgd36d8s8crc822xt57jxvb6m1k6t07o
Vote threadhttps://lists.apache.org/thread/o54ljccbjl4c2vpykpszk5yo825gvjmg
JIRA

Unable to render Jira issues macro, execution error.

Release1.16

Please keep the discussion on the mailing list rather than commenting on the wiki (wiki discussions get unwieldy fast).

Motivation

Currently the statistics information about tables can only be fetched from the catalog by each given partition iteratively. Since getting statistics information from catalogs is a very heavy operation, in order to improve the query performance, we’d better provide functionality to fetch the statistics information of a table for all given partitions in one shot.

Based on the manual performance test, for 2000 partitions, the cost will be improved from 10s to 2s. The improvement result is 500%.

Public Interfaces

Two methods will be added into the interface Catalog to enable fetching table and column statistics for given partitions in bulk mode. 

/**

* Get a list of statistics of given partitions.

*

* @param tablePath path of the table

* @param partitionSpecs partition specs of partitions that will be used to filter out all other unrelated statistics, i.e. the statistics fetch will be limited within the given partitions 

* @return list of statistics of given partitions

* @throws PartitionNotExistException if one partition does not exist

* @throws CatalogException in case of any runtime exception

*/      
default List<CatalogTableStatistics> bulkGetPartitionStatistics(
            ObjectPath tablePath, List<CatalogPartitionSpec> partitionSpecs)
            throws PartitionNotExistException, CatalogException {           
            // default implementation calls getPartitionStatistics(...) iteratively for the given CatalogPartitionSpec list.     
      }     


/**

* Get a list of column statistics for given partitions.

* @param tablePath path of the table   

* @param partitionSpecs partition specs of partitions that will be used to filter out all other unrelated statistics, i.e. the statistics fetch will be limited within the given partitions    

* @return list of column statistics for given partitions

* @throws PartitionNotExistException if one partition does not exist

* @throws CatalogException in case of any runtime exception

*/      

default List<CatalogColumnStatistics> bulkGetPartitionColumnStatistics(
            ObjectPath tablePath, List<CatalogPartitionSpec> partitionSpecs)
            throws PartitionNotExistException, CatalogException {
        // default implementation calls getPartitionColumnStatistics(...) iteratively for the given CatalogPartitionSpec list.     

}


The method naming convention has been kept as more compatible as possible with existing Catalog API with the following thoughts:

for the first method: List<CatalogTableStatistics> bulkGetPartitionStatistics(...):

  • it follows the same rule the method CatalogTableStatistics getPartitionStatistics(...) used.
  • getPartitionStatistics(...) as overloaded method has been rejected, because semantically the returned type is a single instance not a collection of instances. 

  • bulkGetPartitionStatistics(...) has been chosen over listPartitionStatistics(...), because, comparing to database and partition that are static and can be listed, statistics are more dynamic and will need more computation logic to create, therefore using "get" is semantically more feasible than list. The "bulk" gives users the hint that this method will work in the bulk mode and return a collection of instances.
  • return one single accumulated CatalogTableStatistics has been rejected, because we want to provide statistics with feasible granularity. diverse statistical analytics could be performed without modifying the fetch logic.

for the second method: List<CatalogColumnStatistics> bulkGetPartitionColumnStatistics(...):

  • it follows the same rule the method CatalogColumnStatistics getPartitionColumnStatistics(...)
  • getPartitionColumnStatistics(...) has been rejected, because a method with the same name is used in the IMetaStoreClient that returns a map with the key of partition name and value of List<ColumnStatisticsObj>. In our case, the return type does not contain any partition information, therefore the method name should not contain the semantic information for "partition".
  • bulkGetPartitionColumnStatistics(...) has been chosen over listPartitionColumnStatistics(...), because of the same reason mentioned above.
  • return Map<String, List<CatalogColumnStatistics>> has been rejected, because only List<CatalogColumnStatistics> is required based known business requirements. It is a little bit over-engineered, if we use the Map but only the List is used. On the other side, it is easier to add information than remove it from API design's perspective. If we indeed need the Map in the future, we could still upgrade our API with acceptable effort.
  • return one single accumulated CatalogColumnStatistics has been rejected, because we want to provide statistics with feasible granularity. diverse statistical analytics could be performed without modifying the fetch logic.

Proposed Changes


  • Implementations for new added methods will be done in all classes that implement the Catalog interface. 
  • For the concrete HiveCatalog, __HIVE_DEFAULT_PARTITION__ will be taken care of while fetching the column statistics. All currently supported types for partition column will be taken into consideration.

  • The logic of getting TableStates in FlinkRecomputeStatisticsProgram will be optimized from calling the catalog iteratively to bulk fetch.
  • Logic of TableStates conversion will be consolidated into the CatalogTableStatisticsConverter class to make the domain design a little bit cleaner. 

Compatibility, Deprecation, and Migration Plan


This feature introduces two new methods into the Catalog interface, which will be used by third-party implementation. In order to avoid breaking the compatibility, all methods will be provided as default interface method with the default logic that the related method will be called iteratively. For details, please refer to the comments in the code snippet.

 

Test Plan

  • Unit tests will be added to verify the behavior of new methods.
  • performance test will be created.

Rejected Alternatives


None