Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A key design decision in Hudi was to avoid creating small files and always write properly sized files, trading off more time on ingest/writing to keep queries always efficient. Common approaches to writing very small files and then later stitching them together only solve for system scalability issues posed by small files and also let queries slow down by exposing small files to them anyway. 

...

Jira
serverASF JIRA
columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keyHUDI-26
 will take this to the next level, by even collapsing smaller file groups into larger ones.

How do I use DeltaStreamer or Spark DataSource API to write to a Non-partitioned Hudi dataset ?

For writing to a non-partitioned Hudi dataset and perform hive table syncing, you need to set the below configurations:

hoodie.datasource.write.keygenerator.class=org.apache.hudi.NonpartitionedKeyGenerator
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor

Contributing to FAQ 

A good and usable FAQ should be community-driven and crowd source questions/thoughts across everyone. 

...