Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the previous discussion, we know if the cache is stale, HMS will serve the request from ObjectStore. We need to catch up the cache with the latest change. This can be done by the existing notification log based cache update mechanism. A thread in HMS constantly poll from notification log, update the cache with the entries from notification log. The interesting entries in notification log are table/partition writes, and corresponding commit transaction message. When processing table/partition writes, HMS will put the table/partition entry in cache. However, the entry is not immediately usable until the commit message of the corresponding writes is processed, and tag the ValidWriteIdList for all the entries modified by the transactionmark writeid of corresponding table entry committed.

Here is a complete flow for a cache update when write happen (and illustrated in the diagram):

  1. The ValidWriteIdList of cached table is initially [11:7,8]
  2. HMS 1 get a alter_table request. HMS 1 puts alter_table message to notification log
  3. The transaction in HMS 1 get committed. HMS 1 puts commit message to notification log along with the writeid [12:7,8]
  4. The cache update thread in HMS 2 will read the alter_table event from notification log, update the cache with the new version from notification log. However, the entry is not available for read as there’s no writeid associate with it is not updated yet
  5. A read for the entry on HMS 2 will fetch from db since the entry is not available for read
  6. The cache update thread will further read commit event from notification log, mark writeid 12 as committed, the tag the entry with the writeid of cached table entry changed to [12:7,8]
  7. The next read from HMS 2 will serve from cache

...

The use cases discussed so far are driven by a query. However, during the HMS startup, there’s a cache prewarm. HMS will fetch everything from db to cache. There is no particular query drives the process, that means we don’t have ValidWriteIdList of the query. Prewarm needs to generate ValidWriteIdList by itself. To do that, for every table, HMS will query the current global transaction state ValidTxnList (HiveTxnManager.getValidTxns), and then convert it to table specific ValidWriteIdList (HiveTxnManager.getValidWriteIds). As an optimization, we don’t have to invoke HiveTxnManager.getValidTxns per table. We can invoke it every couple of minutes. If ValidTxnList is outdated, we will get an outdated ValidWriteIdList. Next time when Hive read this entry, Metastore will fetch from the db even though it is in fact fresh. There’s no correctness issue, only impact performance in some cases. The other possibility is the entry changes after we fetches ValidWriteIdList. This is not unlikely as fetching all partitions of the table may take some time. If that happens, the cached entry is actually newer than the ValidWriteIdList. The next time Hive reads it will trigger a db read though it is not necessary. Again, there’s no correctness issue, only impact performance in some cases.

Maintain WriteId

HMS will maintain ValidWriteIdList for every cache entry when transactions are committed. The initial ValidWriteIdList is brought in by bootstrap. After that, for every commit message, HMS needs to:

  1. Find the table writeids associated with the txnid, this can be done by a db lookup on TXN_TO_WRITE_ID table, or by processing ALLOC_WRITE_ID_EVENT message in the notification log. I will explain later in detail
  2. Mark the writeid as committed in the ValidWriteIdList associated with the cached tables

As an optimization, we can save a db lookup in #1 by cache the writeid of modified tables of the transaction. Every modified table will generate a corresponding ALLOC_WRITE_ID_EVENT associate txnid with table writeid generated. Upon we receive commit message of the transaction, we can get the table writeids for the transaction. Thus we don’t need to do a db lookup to find the same information. However, in the initial commit message after bootstrap, we might miss some ALLOC_WRITE_ID_EVENT for the transaction. To address this issue, we will use this optimization unless we saw the open transaction event as well. Otherwise, HMS will still go to the db to fetch the writeids.

External Table

Write id is not valid for external tables. And Hive won’t fetch ValidWriteIdList if the query source is external tables. Without ValidWriteIdList, HMS won’t able to check the staleness of the cache. To solve this problem, we will use the original eventually consistent model for external tables. That is, if the table is external table, Hive will pass null ValidWriteIdList to metastore API/CachedStore. Metastore cache won’t store ValidWriteIdList alongside the entry. When reading, CachedStore always retrieve the current copy. The original notification update will update the metadata of external tables, so we can eventually get updates from external changes.

...

When reading, we need to pass a writeid list so HMS can compare it with the cached version. For every commit, we need to tell HMS what is the writeid for the tables changed during the transaction, so HMS can correctly tag the entry with the writeid. We need to add ValidWriteIdList into metastore thrift API and RawStore interfaces , including for all table/partition related read calls and commit transaction calls.

Hive_metastore.thrift and HiveMetaStoreClient.java

...

HMS read API will remain backward compatible for external table. That is, new server can deal with old client. If the old client issue a create_table call, server side will receive the request of create_table with validWriteIdList=null, and will cache or retrieve the entry regardless(with eventual consistency model). For managed table, validWriteIdList will be required and HMS server will throw an exception if validWriteIdList=null.

In commit_txn request, we need to pass a ValidTxnWriteIdList, which is a list of writeid of all modified tables. HMS will use it to tag the cache entries.

...

hive_metastore.thriftOld API

...

New API

...

commitTxn(long txnid)

...

commitTxn(long txnid,string validTxnWriteIdList)

RawStore

ObjectStore will use the additional validWriteIdList field for all read methods to compare with cached ValidWriteIdList

...

  1. Generate new write id for every write operation involving managed tables. Since DbTxnManager cache write id for every transaction, so every query will generate at most one new write id for a single table, even if it consists of multiple Hive.java write API calls
  2. Retrieve table write id from config for every read operation if exists (for managed table, it guarantees to be there in config), and pass the write id to HMS APIPass writeid of all modified tables when commit the transaction

Changes in Other Components

...

Code Block
languagejava
AcidUtils.advanceWriteId(conf, tbl);

...