Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Streaming ingest of data.  Many users have tools such as Apache Flume, Apache Storm, or Apache Kafka that they use to stream data into their Hadoop cluster.  While these tools can write data at rates of hundreds or more rows per second, Hive can only add partitions every fifteen minutes to an hour.  Adding partitions more often leads quickly to an overwhelming number of partitions in the table.  These tools could stream data into existing partitions, but this would cause readers to get dirty reads (that is, they would see data written after they had started their queries) and leave many small files in their directories that would put pressure on the NameNode.  With this new functionality this use case will be supported while allowing readers to get a consistent view of the data and avoiding too many files.
  2. Slow changing dimensions.  In a typical star schema data warehouse, dimensions tables change slowly over time.  For example, a retailer will open new stores, which need to be added to the stores table, or an existing store may change its square footage or some other tracked characteristic.  These changes lead to inserts of individual records or updates of records (depending on the strategy chosen).    Starting with 0.14, Hive is not currently able to support this.  Once INSERT ... VALUES and UPDATE are supported this will be possible.
  3. Data restatement.  Sometimes collected data is found to be incorrect and needs correction.  Or the first instance of the data may be an approximation (90% of servers reporting) with the full data provided later.  Or business rules may require that certain transactions be restated due to subsequent transactions (e.g., after making a purchase a customer may purchase a membership and thus be entitled to discount prices, including on the previous purchase).  Or a user may be contractually required to remove their customer’s data upon termination of their relationship.   Once INSERT ... VALUES Starting with Hive 0.14 these use cases can be supported via INSERT, UPDATE, and DELETE are supported this will be possibleand DELETE.

Limitations

  • In Hive 0.13 INSERT ... VALUES, UPDATE, and DELETE are not yet supported.  BEGIN, COMMIT, and ROLLBACK are also not yet supported.   The  All language operations are auto-commit.  The plan is to support these in the next a future release.
  • Only ORC file format is supported in this first release.  The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
  • Use of the streaming ingest interface (see below) does not yet work with the existing INSERT INTO capability of Hive.  Once a table uses the streaming ingest interface any data added to it via INSERT INTO will be lost.  INSERT OVERWRITE still works, and can be used with streaming ingest.  Note that it will remove data inserted via streaming ingest the same as data written to the partition by other means.
  • By default transactions are configured to be off in Hive 0.13.  See the Configuration section below for a discussion of which values need to be set to configure it.
  • Tables must be bucketed to make use of these features.  Tables in the same system not using transactions and ACID do not need to be bucketed.
  • At this time only snapshot level isolation is supported.  When a given query starts it will be provided with a consistent snapshot of the data.  There is no support for dirty read, read committed, repeatable read, or serializable.  With the introduction of BEGIN the intention is to support snapshot isolation for the duration of transaction rather than just a single query.  Other isolation levels may be added depending on user requests.
  • The existing ZooKeeper and in-memory lock managers are not compatible with transactions.  There is no intention to address this issue.  See Basic Design below for a discussion of how locks are stored for transactions.

...

See the Streaming Data Ingest document for details on using streaming ingest.

Grammar Changes

INSERT...VALUES, UPDATE, and DELETE have been added to the SQL grammar, starting in Hive 0.14.  See LanguageManual DML for details.

Several new commands have been added to Hive's DDL in support of ACID and transactions, plus some existing DDL has been modified.  

...

***Decreasing this value will reduce the time it takes for compaction to be started for a table or partition that requires compaction.  However, checking if compaction is needed requires several calls to the NameNode for each table or partition that has had a transaction done on it since the last major compaction.  So decreasing this value will increase the load on the NameNode.

Configuration Values to Set for INSERT, UPDATE, DELETE

In addition to the new values listed above, some existing values need to be set to support INSERT, UPDATE, and DELETE.

Configuration ValueMust be set to
hive.enforce.bucketingtrue
hive.exec.dynamic.partition.modenonstrict

Configuration Values to Set for Compaction

If the data in your system is not owned by the Hive user (i.e. the user that the Hive metastore runs as), then Hive will need permission to run as the user who owns the data in order to perform compactions.  If you have already set up HiveServer2 to impersonate users, then the only additional work to do is assure that Hive has the right to impersonate users from the host running the Hive metastore.  This is done by adding the hostname to hadoop.proxyuser.hive.hosts in Hadoop's core-site.xml file.  If you have not already done this, then you will need to configure Hive to act as a proxy user.  This requires you to set up keytabs for the user running the Hive metastore and add hadoop.proxyuser.hive.hosts and hadoop.proxyuser.hive.groups to Hadoop's core-site.xml file.  See the Hadoop documentation on secure mode for your version of Hadoop (e.g., for Hadoop 2.5.1 it is at Hadoop Secure Mode).

Table Properties

If a table owner does not wish the system to automatically determine when to compact, then the table property NO_AUTO_COMPACTION can be set.  This will prevent all automatic compactions.  Manual compactions can still be done with Alter Table/Partition Compact statements.

...