HCatalog 0.2 Architecture

Introduction

This document captures the architectural roadmap for the 0.2 (second) release of HCatalog.

Architecture

Deployment

The change for deployment between version 0.1 and 0.2 will be the ActiveMQ server.

The initial proposal is to place the ActiveMQ servers on the same machine as the Thrift servers. This avoids adding more hardware for messaging. However, we need to test that the Thrift server and ActiveMQ server can co-exist without overtaxing the machines. In the event that that is not the case the ActiveMQ servers can be placed on separate hardware.

In version 0.1 there is only one Thrift server. In version 0.2 there is a proposal to use two for high availability (see #ThriftHighAvailability below). In highly available mode ActiveMQ uses two servers, one master and one slave.

Interfaces

HCatalog's interface to MapReduce, Pig, and the storage layer are not changing from 0.1 to 0.2.

One new interface is being added in 0.2, notification. This will enable other grid components such as Oozie to learn when new data is available in HCatalog. Initially Oozie is the only user for this feature. In order to integrate as easily as possible with different messaging services in different installations HCatalog will use a JMS compliant message service.

HCatalog will publish its messages using a publish-subscribe model. Further investigation is needed to determine the mapping of tables to topics. One proposal is to have a topic for each table. Since some JMS implementations create a socket and a thread per topic that a listener is listening for, this may overload listeners who are interested in many tables. However, we are concerned that having one topic for all HCatalog messages will lead to listeners needing to filter out the 99% of messages they are not interested in. If there is not a way to pool sockets and threads between topics we may instead need to have a topic per database. In the end, HCatalog is not concerned with how topics are laid out, as long as there is a simple way for it and its users to determine the proper topic to use for a given table.

Two types of messages will be published on these topics, one for each new partition that is created and one that a writer can use to signal when it is done creating a set of messages, referred to as the "mark set done" message. At some future point we may wish to publish other table events (dropping partitions, create table, drop table, etc.). However, currently no use cases have been identified for these messages.

The message announcing a new partition will be sent automatically by HCatalog with no intervention from the user. It will be published on the topic for that table. The message will contain the values of the partition keys that this partition instantiates.

Many data consumers do not want to begin processing until a set of partitions is completed. In HCatalog there is no notion of when a set of partitions is complete. To address this customer need, a second message type, "mark set done", is being added. Since HCatalog does not know when a set is done the data producer will explicitly inform HCatalog that a set of partitions is done. Part of this notification will be a specification of partitions that are in the set. This specification will be expressed by specifying a partial set of partition key values. All partitions matching the partial specification will be placed in the set that is being marked done. For example, if a data producer has a table partitioned on datestamp, region, and property and they wish to signal that all data production for a given day is done, they would send a mark set done message with datestamp=20110228. All partitions with a datestamp of 20110228 will then be taken to be done. Note that this can be done multiple times at various levels. Continuing the previous example, the data producer could also signal when all US processing for the day is done by sending a mark set done message with datestamp=20110228 region=us and then later still send the message for the whole day.

The message will be sent on the topic for that table. It will contain the specification condition used to determine the set. This will allow data consumers to determine if this is the message they are waiting on. Partition keys will be sorted lexically in these messages so that data consumers can do simple string comparisons to look for the message they are interested in.

HCatStorer and HCatOutputFormat will be modified to allow users to send a mark set done message when all partitions written by that write operation have succeeded. In HCatStorer this will be done via an additional constructor argument. In HCatOutputFormat it can be done via a mutator function sendMarkSetDone().

Since this can be done multiple times it is not possible to store the results on a per partition level in the HCatalog metastore. However, clients need to be able to check whether the messages they need have already been sent before they began listening to the topic. It will not be possible to use JMS's durable subscriber feature because it requires the subscriber to be created beforehand. This will not work as new users will come along and want to subscribe to notifications for a table. To address this issue HCatalog will store "mark set done" messages in a special table for some amount of time and clients will be responsible for checking that table after listening on the topic (it has to be after to avoid missing messages between when they check and when they subscribe). The amount of time these messages are stored will be configurable per table, since different data is read at different frequencies.

HCatalog will not store "add partition" messages. Subscribes can check for the existence of a partition directly.

In secure Hadoop installations, writing to this message bus will need to be restricted to HCatalog only. Initially all users with access to grid machines will be allowed to read messages on the message bus.

HCatalog will assume that once the publish call returns that the messaging service guarantees delivery of the message.

New Features

Dynamic Partitioning

A common pattern when users create data is to write more than one partition into a table simultaneously. That is, a single invocation of HCatStorer or HCatOutputFormat may spray data across multiple partitions. HCatalog 0.1 only supports writing to a single partition at a time. In version 0.2 support for this use case will be added.

Hive already supports dynamic partitioning, so little or no changes will be required on the metastore side. The work will be focussed in HCatOutputFormat, which will have to manage spraying across multiple partitions.

One concern expressed consistently by the HDFS team is that this feature will enable users to create hundreds or even thousands of partitions at a time. Since each partition is a directory with potentially many part files, this can cripple the NameNode very quickly. As part of this feature we need to mitigate this. Two approaches are being explored.

The proposed solution is to write the data in partitions and then make a second pass over it to convert it to har format. This has a performance penalty on the writes since all data must be re-written before being made available to users. This option assumes that the performance penalty of reading from har is acceptable. Initial testing and prototyping has found that as is har is too slow, but with minor changes its read time can be brought to within a few seconds of the read time of regular HDFS files.

This automatic rewrite of data would only be triggered if there are a certain number of files under 1 HDFS block in size. There is little point in combining large HDFS files. The rewrite would be triggered by HCatOutputFormat in the OutputCommitter task.

Import/Export

Work to add import and export tools for HCatalog has been ongoing. Due to testing constraints and issues with getting the work checked into the Hive codebase in a timely manner this feature was not part of 0.1. However, the development work has already been done. The design for the feature is available on Pig's wiki.

High Availability for Thrift Server

Users have expressed significant concern over the lack of high availability in the HCatalog Thrift server. HCatalog's database can be made highly available using standard procedures for MySQL databases. But currently there is no method for dealing with failure in the Thrift server used by HCatalog. Placing a VIP (Virtual IP) server in front of multiple HCatalog Thrift servers has been proposed as a way to avoid a single point of failure. This is being investigated. In non-secure mode, all Thrift operations are stateless and thus will work with a VIP setup. In secure mode, the delegation tokens are stored in the Thrift server. We will need to determine how to handle this issue.

Child pages

HCatalog02Design