Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A key design decision in Hudi was to avoid creating small files and always write properly sized files, trading off more time on ingest/writing to keep queries always efficient. Common approaches to writing very small files and then later stitching them together only solve for system scalability issues posed by small files and also let queries slow down by exposing small files to them anyway. 


Hudi has the ability to maintain a configured target file size, when performing upsert/insert operations. (Note: bulk_insert operation does not provide this functionality and is designed as a simpler replacement for normal `spark.write.parquet`  )


For copy-on-write, this is as simple as configuring the maximum size for a base/parquet file  and the soft limit below which a file should be considered a small file. Hudi will try to add enough records to a small file at write time to get it to the configured maximum limit. For e.g , with `compactionSmallFileSize=100MB` and limitFileSize=120MB, Hudi will pick all files < 100MB and try to get them upto 120MB. 

...