Page History

...

Attribute	Streaming API	Mutation API
Ingest type	Data arrives continuously.	Ingests are performed periodically and the mutations are applied in a single batch.
Transaction scope	Transactions are created for small batches of writes.	The entire set of mutations should be applied within a single transaction.
Data availability	Surfaces new data to users frequently and quickly.	Change sets should be applied atomically, either the effect of the delta is visible or it is not.
Sensitive to record order	No, records do not have pre-existing lastTxnIds or bucketIds. Records are likely being written into a single partition (today's date for example).	Yes, all mutated records have existing `RecordIdentifiers` and must be grouped by [partitionValues, bucketId] and sorted by lastTxnId. These record coordinates initially arrive in an order that is effectively random.
Impact of a write failure	Transaction can be aborted and producer can choose to resubmit failed records as ordering is not important.	Ingest for the respective must group (partitionValues + bucketId) must be halted and failed records resubmitted to preserve sequence.
User perception of missing data	Data has not arrived yet → "latency?"	"This data is inconsistent, some records have been updated, but other related records have not" – consider here the classic transfer between bank accounts scenario.
API end point scope	A given `HiveEndPoint` instance submits many transactions to a specific bucket, in a specific partition, of a specific table.	A set of `MutationCoordinators` writes changes to unknown set of buckets, of an unknown set of partitions, of specific tables (can be more than one), within a single transaction.

...

A few things are currently required to use streaming.

Currently, only ORC storage format is supported. So 'stored as orc' must be specified during table creation.
The Hive table must be bucketed, but not sorted. So something like 'clustered by (colName) into 10 buckets' must be specified during table creation. See Bucketed Tables for a detailed example.
User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table.
Settings required in Hive transactions must be configured for each table (see Hive Transactions – Table Properties) as well as in hive-site.xml for MetaStore (see see Hive Transactions – Configuration):hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.support.concurrency = true
hive.compactor.initiator.on = true

hive.compactor.worker.threads > 0

Note: Hive also supports streaming mutations to unpartitioned tables.

...

Space shortcuts

Child pages

Versions Compared

Old Version 5

New Version 6

Key