THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
Attribute | Streaming API | Mutation API |
---|---|---|
Ingest type | Data arrives continuously. | Ingests are performed periodically and the mutations are applied in a single batch. |
Transaction scope | Transactions are created for small batches of writes. | The entire set of mutations should be applied within a single transaction. |
Data availability | Surfaces new data to users frequently and quickly. | Change sets should be applied atomically, either the effect of the delta is visible or it is not. |
Sensitive to record order | No, records do not have pre-existing lastTxnIds or bucketIds. Records are likely being written into a single partition (today's date for example). | Yes, all mutated records have existing RecordIdentifiers and must be grouped by [partitionValues, bucketId] and sorted by lastTxnId. These record coordinates initially arrive in an order that is effectively random. |
Impact of a write failure | Transaction can be aborted and producer can choose to resubmit failed records as ordering is not important. | Ingest for the respective must group (partitionValues + bucketId) must be halted and failed records resubmitted to preserve sequence. |
User perception of missing data | Data has not arrived yet → "latency?" | "This data is inconsistent, some records have been updated, but other related records have not" – consider here the classic transfer between bank accounts scenario. |
API end point scope | A given HiveEndPoint instance submits many transactions to a specific bucket, in a specific partition, of a specific table. | A set of MutationCoordinators writes changes to unknown set of buckets, of an unknown set of partitions, of specific tables (can be more than one), within a single transaction. |
...
A few things are currently required to use streaming.
- Currently, only ORC storage format is supported. So '
stored as orc
' must be specified during table creation. - The Hive table must be bucketed, but not sorted. So something like '
clustered by (colName) into 10 buckets
' must be specified during table creation. See Bucketed Tables for a detailed example. - User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table.
- Settings required in Hive transactions must be configured for each table (see Hive Transactions – Table Properties) as well as in
hive-site.xml
for MetaStore (see see Hive Transactions – Configuration):hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager hive.support.concurrency = true
hive.compactor.initiator.on = true
hive.compactor.worker.threads > 0
Note: Hive also supports streaming mutations to unpartitioned tables.
...