Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: more edits per Elliot West's doc comments on HIVE-10165, plus links to other docs

...

AttributeStreaming APIMutation API
Ingest typeData arrives continuously.Ingests are performed periodically and the mutations are applied in a single batch.
Transaction scopeTransactions are created for small batches of writes.The entire set of mutations should be applied within a single transaction.
Data availabilitySurfaces new data to users frequently and quickly.Change sets should be applied atomically, either the effect of the delta is visible or it is not.
Sensitive to record orderNo, records do not have pre-existing lastTxnIds or bucketIds. Records are likely being written into a single partition (today's date for example).Yes, all mutated records have existing RecordIdentifiers and must be grouped by [partitionValues, bucketId] and sorted by lastTxnId. These record coordinates initially arrive in an order that is effectively random.
Impact of a write failureTransaction can be aborted and producer can choose to resubmit failed records as ordering is not important.Ingest for the respective must group (partitionValues + bucketId) must be halted and failed records resubmitted to preserve sequence.
User perception of missing dataData has not arrived yet → "latency?""This data is inconsistent, some records have been updated, but other related records have not" consider here the classic transfer between bank accounts scenario.
API end point scopeA given HiveEndPoint instance submits many transactions to a specific bucket, in a specific partition, of a specific table.A set of MutationCoordinators writes changes to unknown set of buckets, of an unknown set of partitions, of specific tables (can be more than one), within a single transaction.

...

A few things are currently required to use streaming. 

  1. Currently, only ORC storage format is supported. So 'stored as orc' must be specified during table creation.
  2. The Hive table must be bucketed, but not sorted. So something like 'clustered by (colName) into 10 buckets' must be specified during table creation. See Bucketed Tables for a detailed example.
  3. User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table.
  4. Settings required in Hive transactions must be configured for each table (see Hive Transactions – Table Properties) as well as in hive-site.xml for MetaStore  (see see Hive Transactions – Configuration):hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
  5. hive.support.concurrency = true
  6. hive.compactor.initiator.on = true
  7. hive.compactor.worker.threads > 0

Note: Hive also supports streaming mutations to unpartitioned tables.

...