Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Generally, the more records are included in each transaction the more throughput can be achieved.  It's common to commit either after a certain number of records or after a certain time interval, whichever comes first.  The later ensures that when event flow rate is variable, transactions don't stay open too long.  There is no practical limit on how much data can be included in a single transaction.  The The only concern is amount of data which will need to be replayed if the transaction fails.  The The concept of a TransactionBatch serves to reduce the number of files (and delta directories) created by HiveStreamingConnection API in the filesystem. Since all transactions in a given transaction batch write to the same physical file (per bucket), a partition can only be compacted up to the the level of the earliest transaction of any batch which contains an open transaction.  Thus TransactionBatches should not be made excessively large.  It makes sense to include a timer to close a TransactionBatch (even if it has unused transactions) after some amount of time.

The HiveStreamingConnection is highly optimized for write throughput (Delta Streaming Optimizations) and as a result the delta files generated by hive streaming ingest has many of the orc features disabled (dictionary encoding, indexes, compression etc.) to facilitate high throughput writes. When compactor kicks in these delta files gets rewritten into read and storage optimized orc format (enable dictionary encoding, indexes and compression). So it is recommended to configure compactor more aggressively/frequently (refer Compactor) to generate compacted and optimized orc files.

Notes about the HiveConf Object

...

  • hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
  • hive.support.concurrency = true
  • hive.metastore.execute.setugi = truehive.execution.engine = mr
  • hive.exec.dynamic.partition.mode = nonstrict
  • hive.exec.orc.delta.streaming.optimizations.enabled = true
  • hive.metastore.client.cache.enabled = false

...