Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • This means new parquet file size can be smaller than previous version for same file. We saw this happen earlier because parquet version incompatibility. We added guarding checks against that. So we may have to add exceptions for these rules.
  • In some cases, we create empty parquet files. In my testing, smallest size we could create was 400KB. This is because we store metadata including schema in empty parquet file. These empty parquet files can be reused if the partition grows in subsequent writes. Otherwise, we need a strategy to delete these empty file groups cleanly to reclaim this space.
    • One option to reclaim this space is to extend BaseFileOnlyView to inspect small parquet files and ignore them when listing splits. (Appreciate any other suggestions here as I do not have a lot of experience reading this code)
    • Other option is to extend metadata to mark these file groups as invalid. We update all operations to read metadata first to discard these empty files. There can be race conditions if this is not done carefully. (More details can be discussed after we finalize an approach)
  • For MOR tables, this is somewhat awkward.
    • We have to schedule compaction before writing new data.
    • For ‘insert overwrite’, we are forced to write parquet files as opposed to writing inserts also into log files for performance reasons. There is additional work required to support inserts go into log files.
  • External index such as HBase Index needs to be updated to ‘delete’ keys that are no longer present in the partitions. Otherwise, index can get inconsistent and may effect record distribution. 

...