Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: a few minor edits

...

  1. The following settings are required in hive-site.xml to enable ACID support for streaming:
    1. hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
    2. hive.compactor.initiator.on = true
    3. hive.compactor.worker.threads > 0 
  2. “stored as orc” must be specified during table creation. Only ORC storage format is supported currently.
  3. The Hive table must be bucketed, but not sorted. So something like “clustered by (colName) into 10 buckets” must be specified during table creation. The number of buckets is ideally the same as the number of streaming writers.
  4. User of the client streaming process must have the necessary permissions to write to the table or partition and create partitions in the table.
  5. When issuing MapReduce queries on streaming tables, the user must set hive.input.format to org.apache.hadoop.hive.ql.io.HiveInputFormat.
  6. (Temporary requirements) When issuing queries on streaming tables, the client needs to set
    1. hive.vectorized.execution.enabled  to  false     (for hive Hive version < 0.14.0)

    2. hive.input.format  to  org.apache.hadoop.hive.ql.io.HiveInputFormat

...

TransactionBatch is used to write a series of transactions.  For each transaction, the application calls beginNextTransactionwrite, and then commit or abort as appropriate. See the Javadoc for details.

Notes about the HiveConf

...

Object

HiveEndPoint.newConnection() accepts a HiveConf argument. This can either be set to null, or a pre-created HiveConf object can be provided. If this is null, a HiveConf object will be created internally and used for the connection. When a HiveConf object is instantiated, if the directory containing the hive-site.xml is part of the java classpath, then the HiveConf object will be initialized with values from it. If no hive-site.xml is found, then the object will be initialized with defaults. Pre-creating this object and reusing it across multiple connections may have a noticeable impact on performance if connections are being opened very frequently (for e.g. example several times a second). Secure connection relies on 'hive.metastore.kerberos.principal' being set correctly in the HiveConf object.

...