Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: removing incorrect references to 'proxyUser'

...

It is very likely that in a setup where data is being streamed continuously the data is added into new partitions periodically. Either the Hive admin can pre-create the necessary partitions or the streaming clients can create them as needed. HiveEndPoint.newConnection() accepts a boolean argument to indicate whether the partition should be auto created. Partition creation being an atomic action, multiple clients can race to create the partition, but only one will succeed, so streaming clients do not have to synchronize when creating a partition. The ‘proxyUser’ argument indicates the user to be impersonated when talking to Hive and HDFS in all subsequent actions performed on the connection. If left null, impersonation will be disabled and all actions are performed as the user of the running process. For impersonation to work, hive-site.xml needs to be configured to allow the user of the streaming client to impersonate the indicated user.

Transactions are implemented slightly differently than traditional database systems. Each transaction has an id and multiple transactions are grouped into a “Transaction Batch”.  After connection, a streaming client first requests a new batch of transactions. In response it receives a set of Transaction Ids that are part of the transaction batch. Subsequently the client proceeds to consume one transaction id at a time by initiating new Transactions. The client will write() one or more records per transaction and either commits or aborts the current transaction before switching to the next one. Each TransactionBatch.write() invocation automatically associates the I/O attempt with the current Txn ID.  The user of the streaming client, or the proxy user if indicated, needs to have write permissions to the partition or table. Kerberos based authentication is required to acquire connections as a specific user. See secure streaming example below.

Concurrency Note: I/O can be performed on multiple TransactionBatches concurrently. However the transactions within a transaction batch must be consumed sequentially.

...