Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: add toc and subheadings, minor edits

ACID and Transactions in Hive

Table of Contents

What is ACID and why should you use it?

...

Basic Design

HDFS does not support in-place changes to files.  It also does not offer read consistency in the face of writers appending to files being read by a user.  In order to provide these features on top of HDFS we have followed the standard approach used in other data warehousing tools.  Data for the table or partition is stored in a set of base files.  New records, updates, and deletes are stored in delta files.  A new set of delta files is created for each transaction (or in the case of streaming agents such as Flume or Storm, each batch of transactions) that alters a table or partition.  At read time the reader merges the base and delta files, applying any updates and deletes as it reads.

Delta File Compaction

Occasionally these changes need to be merged into the base files.  To do this a set of threads have been added to the Hive metastore.  They determine when this compaction needs to be done, execute the compaction, and then clean up afterwards.  There are two types of compactions, minor and major. 

  • Minor compaction takes a set of existing delta files and rewrites them to a single delta file per bucket.

...

  • Major compaction takes one or more delta files and the base file for the bucket and rewrites them into a new base file per bucket.

  All compactions are done in the background and do not prevent concurrent reads and writes of the data.  After a compaction the system waits until all readers of the old files have finished and then removes the old files.

Base and Delta Directories

Previously all files for a partition (or a table if the table is not partitioned) lived in a single directory.  With these changes, any partitions (or tables) written with an ACID aware writer will have a directory for the base files and a directory for each set of delta files.

Lock Manager

A new lock manager has also been added to Hive, the DbLockManager.  This lock manager stores all lock information in the metastore.  In addition all transactions are stored in the metastore.  This means that transactions and locks are durable in the face of server failure.  To avoid clients dying and leaving transaction or locks dangling a heartbeat is sent from lock holders and transaction initiators to the metastore on a regular basis.  If a heartbeat is not received in the configured amount of time, the lock or transaction will be aborted.

...

Configuration key

Default

Value to turn on transactions

Notes

hive.txn.manager 

org.apache.hadoop.hive.ql.lockmgr.DummyTxnManager

org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

DummyTxnManager replicates pre Hive-0.13 behavior and provides no transactions.

hive.txn.timeout 

300

 

Time after which transactions are declared aborted if the client has not sent a heartbeat, in seconds.

hive.txn.max.open.batch

1000

 

Maximum number of transactions that can be fetched in one call to open_txns().* 

hive.compactor.initiator.on

false

true (for exactly one instance of the Thrift metastore service)

Whether to run the initiator and cleaner threads on this metastore instance.

 

hive.compactor.worker.threads

0

> 0 on at least one instance of the Thrift metastore service

How many worker threads to run on this metastore instance.**

hive.compactor.worker.timeout

86400

 

Time in seconds after which a compaction job will be declared failed and the compaction re-queued.

hive.compactor.check.interval

300

 

Time in seconds between checks to see if any partitions need to be compacted.***

hive.compactor.delta.num.threshold

10

 

Number of delta directories in a partition that will trigger a minor compaction.

hive.compactor.delta.pct.threshold

0.1

 

Percentage size of the deltas relative to the base that will trigger a major compaction. 1 = 100%

hive.compactor.abortedtxn.threshold

1000

 

Number of aborted transactions on a given partition that will trigger a major compaction. 

*hive hive.txn.max.open.batch controls how many transaction streaming agents such as Flume or Storm open simultaneously.  The streaming agent then writes that number of entries into a single file (per Flume agent or Storm bolt).  Thus increasing this value decreases the number of files created by streaming agents.  But it also increases the number of open transactions that Hive has to track, which may negatively affect read performance.

...