Table of Contents

This document describes the second version of Hive Replication. Please refer to the first version of Hive Replication for details on prior implementation.

Issues with the

...

Current Replication System

Some of the observed issues with the current replication implementation are as follows:

SlowSlowness
Requires Requiring staging dirs with full copies (4xcopy problem)
Unsuitable Unsuitability for load-balancing use-cases
Incompatibility with ACID
Dependency on external tools to do a lot (staging dir mgmt, manage state info/, etc.)

We will thus first try to understand why each of these occur occurs and what we can do about them.

Slowness

So why is it currently Why is the first version of Hive Replication slow?

The primary reason for its slowness is because that it depends on state transfer, rather than delta-replay. This means that the amount of data being funneled across to the destination is much larger than it otherwise would be. This is especially a problem with frequent updates/inserts. (Creates cannot be delta-optimized since the original state is null, and deletes are instantaneous.)

The secondary reason is that the original implementation was made to make sure we got designed to ensure "correctness" work in terms of resilience, and we . We were planning optimizations that would drastically reduce the number of events we'd processprocessed, but we these have not yet been implemented those optimizations. The optimization optimizations would have worked by processing a window of events at a time and skipping the processing of some of the events if it turns out that when a future event nullified the effect of processing the first event (as in cases where an insert was followed by an insert, or a drop followed a create, etc.). Thus, our current implementation can be seen as a naive implementation where the window size is 1.

Requiring

...

Staging Directories with Full Copies (4xcopy

...

Problem)

Again, this problem comes down to us needing to do a state transfer, and using export and import to do it. The first copy we have is the source table. Then, we have to export that to , which is then exported to a staging directory - this . This is our the second copy. This then It has to be dist-cp-ed over to the destination cluster, which then forms our the third copy. Then, on upon import, it impresses the data on to the destination table - this is our 4th , becoming the fourth copy.

Now, 2 two of these copies, the source table and the dest destination table, are necessary from the very nomenclature of replication - we want 2 two copies . Now, the fact that we require 2 additional copies are needed. The chief issue is that two additional copies are required temporarily in staging directories is the cheif issue here. For clusters without much temporary overflow space, this becomes a major constraint.

...

On the destination side, optimizations are certainly possible, and in fact, are actually done, so that we don't have an extra copy here. On import, import actually moves the files to the dest destination table, thus, our 4xcopy problem is actually a 3x copy problem. However, we've taken to calling this the 4xcopy problem since that was the first problem we hit and then solved.

However, what this optimization does mean is that if we fail during import after moving, then the redo of that emport import will actually require a redo of the export side as well, and thus, is not trivially retriable. This was a conscious decision as the likelihood of this happening is low, in comparison to the other failure points we have. If we were to desire to make the import resiliently retriable as well, we will have a 4x copy problem.

Unsuitable for

...

Load-Balancing Use Cases

By forcing an "export" early, we handle DR usecasesuse cases, so that even if the source hive wh should be compromised, we will not suffer unduly, and can replay our exported commands on the destination and recover. However, in doing so, we treat each table and partition as independent objects, for which the only important consideration is that we save the latest state each, without consideration to how they got there.

...

Essentially the primary problem is that the state transfer approach means that each object is considered independent, and each object can "rubber-band" to the latest version of itself. If all events have been processed, we will be in a stable state which is identical to the source, and thus, this can work for load balancing for users that have a pronounced "loading period" on their warehouse that is separate from their "reading period", which allows us time in the middle to catch up and process all events. This is also true at a table level. This can work for many of the traditional data warheousing warehousing use cases, but fails for many analytics-like expectations.

We will delve further into this rubberbanding rubber-banding problem in a separate section later, since it is a primary problem we attempt to solve in Replv2.

...

Thus, it is ironic that in our implementation of replication, we do not support ACID tables. We've considered what we would need to do to replicate ACID tables, and in most discussions, a popular notion seems to be one of using streaming to send deltas over to the destination, rather than to copy over the files and trying to fudge around with the transaction metadata. This, however, will require quite some more work, and thus, is not something we're planning on addressing in replv2 either. It is likely to be a major push/focus of the next batch of work we put into replication.

Dependency on

...

External Tools To Do a Lot

Our current implementation assumes that we extend how EXPORT and IMPORT work, allow a Notification and ReplicationTask/Command based api that an external tool can use to implement replication on top of us. However, this means that they are the ones that have to manage staging directories, and in addition, have to manage the notion of what state each of our destination tables/dbs are in, and over time, there is a possibility of extensive hive logic bleeding into them. Apache Falcon has a tool called HiveDR, which has implemented these interfaces, and they've expressed a desire that hive take on some more of the management aspects for a cleaner interface.

To this end, one of the goals of replv2 would be that we manage our own staging directories, and instead of replication tools being the ones that move data over, we step in more proactively to pull the data from the source to the destination.

Support for a

...

Hub-

...

Spoke Model

One more piece of feedback we got was the desire to support a hub-spoke model for replication. While there is nothing in the current design of replication that prevents the deployment of a hub-spoke model, the current implementations by third party tools on top of Hive Replication did not explicitly support a 1:n replication, since they wind up needing to do far too much book-keeping. Now that we take on more of the responsibilities of replication on to hive, we should not have a situation whereby we introduce design artiacts that make hub-spoke replication harder.

...

State-transfer has a few good things going for it, such as being resilient and idempotent, but it introduces this problem of temporary states that are possible which never existed in the source, and this is a big no-no for load-balancing use-cases where the destination db is not simply a cold backup but a db that is actively being used for reads.

Change

...

Management

Let us now consider a base part of a replication workflow. It would need to have the following parts:

...

With this notion of EXPORT creating _files as indirections to the actual files, and IMPORT loading _files to locate the actual files needing copying, we solve the 4x copy problem.

Solution for

...

Rubber Banding

Here is a possible solution to the rubberbanding rubber banding problem described earlier:
For each metastore event for which a notification is generated, store the metadata object (e.g. table, partition etc), the location of the files (associated with the event) and the checksum of each affected file (the reason for storing the checksum is explained shortly). In case of events which delete files (e.g. drop table/partition), move the deleted files to a configurable location on the file system (let's call it $cmroot for purpose of this discussion) instead of deleting them.

...

This then allows us to generate the appropriate object on the destination at the time the destination needs updating to that state, and not earlier. This, in conjunction with the file-pseudo-snapshotting that we introduce, allows us to replay state on the destination for both metadata and data.

A

...

Need for

...

Bootstrap

One of the requests we got was that by offloading too much of the requirements of replication, we push too much "hive knowledge" over to the tools that integrate with us, asking them to essentially bootstrap the destination warehouse to a point where it is capable of receiving incremental updates. Currently, we recommend that users run a manual "EXPORT ... FOR REPLICATION" on all tables involved, set up any dbs needed and IMPORT these dumps as needed, etc, to prepare a destination for replicating into. We need to introduce a mechanism by which we can set up a replication dump at a larger scale than just tables, say, at a DB level. For this purpose, the best fit seemed to be a new tool or command, similar to mysqldump.

...

The second part is a little more involved, and needs to do some consolidation during the dump generation. We will discuss this in short order, after a brief detour of new commands we introduce to manage the replication dump and reload.

New

...

Commands

The current implementation of replication is built upon existing commands EXPORT and IMPORT. These commands are semantically more suited to the task of exporting and importing, than of a direct notion of an applicable event log. The notion of a lazy _files behaviour on EXPORT is not a good fit, since EXPORTs are done with the understanding that they need to be a stable copy irrespective of cleanup policies on the source. In addition, EXPORTing "events" is something that is more tenuous. EXPORTing a CREATE event is easy enough, but it is a semantic stretch to export a DROP event. Thus, to fit our needs better, and to not have to keep making the existing EXPORT and IMPORT way more complex, we introduce a new REPL command, with three modes of operation: REPL DUMP, REPL LOAD and REPL STATUS.

...

Will return the same output that REPL LOAD returns, allows REPL LOAD to be run asynchronously. If no knowledge of a replication associated with that db / db.tbl is present, i.e., there are no known replications for that, we return an empty set. Note that for cases where a destination db or table exists, but no known repl exists for it, this should be considered an error condition for tools calling REPL LOAD to pass on to the end-user, to alert them that they may be overwriting an existing db/table with another.

Bootstrap,

...

Revisited

When we introduced the notion of a need for bootstrap, we said that the problem of time passing during the bootstrap was something of a problem that needed solving separately.

...

Space shortcuts

Child pages

Versions Compared

Old Version 15

New Version 16

Key

Issues with the

Current Replication System

Slowness

Requiring

Staging Directories with Full Copies (4xcopy

Problem)

Unsuitable for

Load-Balancing Use Cases

Dependency on

External Tools To Do a Lot

Support for a

Hub-

Spoke Model

Change

Management

Solution for

Rubber Banding

A

Need for

Bootstrap

New

Commands

Bootstrap,

Revisited

Space shortcuts

Child pages

Page History

Versions Compared

Old Version 15

New Version 16

Key

Issues with the

Current Replication System

Slowness

Requiring

Staging Directories with Full Copies (4xcopy

Problem)

Unsuitable for

Load-Balancing Use Cases

Dependency on

External Tools To Do a Lot

Support for a

Hub-

Spoke Model

Change

Management

Solution for

Rubber Banding

A

Need for

Bootstrap

New

Commands

Bootstrap,

Revisited