Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: changed location of _metadata section to fit flow, and removed references to REPL DUMP command in that section, since that is covered later in the doc.

...

However, the idea behind the snapshot is still what we really want, and if HDFS cannot support the number of snapshots that we would create, it is possible for us to do a pseudo-snapshot, so that for all files that are backing hive objects, if we detect any hive operation would move them away or modify them, we retain the original in a separate directory, similar to how we manage Trash. This pseudo-trash like capturing behaviour is what we refer to as the "change-management" piece and is the main piece that needs to be in place to solve the rubberbanding problem as well as the 4x copy problem.

_

...

files

Currently, when we do an EXPORT of a table, the directory structure created in this dump has, at its root, a _metadata file that contains all the metadata state to be impressed, and then has directory structures for each partition to be impressed.

To populate each of the partition directories,

The intent of generating metadata for an event on the source, is to store enough metadata in the notification object of an event, to be able to replicate the exact event on the destination side. As an example, consider the following SQL issued at the source:

CREATE TABLE IF NOT EXISTS default.person (name string, age int);

In this case, a corresponding "CREATE TABLE" metastore event (with a unique id - assume 100 in this case) will be generated, which will generate a new notification to be stored in metastore DB. We plan to store the entire table object (corresponding to the newly created default.person) in the metastore DB as part of the notification event.

Subsequent to the initial bootstrap dump/load), consider the following REPL DUMP command issued at source:

REPL DUMP default.person FROM 100;

The dump command will read the table metadata for the event 100 and generate an appropriate _metadata file which can be replayed at the destination to reach the identical state.

_files

Currently, when we do an EXPORT of a table, the directory structure created in this dump has, at its root, a _metadata file that contains all the metadata state to be impressed, and then has directory structures for each partition to be impressed.

To populate each of the partition directories, it runs a CopyTask that copies the files of each of the partitions over. Now, to make sure that we do not do secondary copies, our design is very simple - instead of a CopyTask, we use a ReplCopyTask, which, will, instead of copying the files to the destination directory, will instead create a file called _files in the destination directory with a list of each of the filenames of the original files.

...

When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the <location> (the _files contains the path of <location>), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the time event 100 occurred at the source, we do the following:

  1. When Event 100 occurs at the source, in the notification event, we store the checksum of the file(s) in the newly added partition along with the file path(s).
  2. When Event 110 occurs at the source, we

...

  1. move the files of the dropped partition to $cmroot/database/tbl/p=1 instead of purging them.
  2. When Event 100 120 occurs at the source, in the notification event, we store the checksum of the again store the checksum of the file(s) in the newly added partition along with the file path(s).

Now when Event 100 is replayed at the destination at a later point, the destination calculates the checksum for file(s) in the

...

partition path. If the checksum differs (for example in this case due to Event 110 and Event 120 that have occurred at the source), the destination looks for those files in $cmroot/database/

...

table/p=1

...

, and pulls them from this location to replicate the state of the source, at the time Event 100 had occurred on the source.

_metadata

Currently, when we EXPORT a table or partition, we generate a _metadata file, which contains a snapshot of the metadata state of the object in question. This _metadata is generated at the point the EXPORT is done. For the purposes of solving rubberbanding, we now also have a need to be able to capture the metadata state of the object in question at the time an event happens. Thus, DbNotificationListener is being enhanced to also store a snapshot of the object itself, rather than just the name of the object, and at event-export time, it takes the metadata not from the metastore, but from the event data.

This then allows us to generate the appropriate object on the destination at the time the destination needs updating to that state, and not earlier. This, in conjunction with the file-pseudo-snapshotting that we introduce, allows us to replay state on the destination for both metadata and dataNow when Event 100 is replayed at the destination at a later point, the destination calculates the checksum for file(s) in the partition path. If the checksum differs (for example in this case due to Event 110 and Event 120 that have occurred at the source), the destination looks for those files in $cmroot/database/table/p=1, and pulls them from this location to replicate the state of the source, at the time Event 100 had occurred on the source.

A need for bootstrap

One of the requests we got was that by offloading too much of the requirements of replication, we push too much "hive knowledge" over to the tools that integrate with us, asking them to essentially bootstrap the destination warehouse to a point where it is capable of receiving incremental updates. Currently, we recommend that users run a manual "EXPORT ... FOR REPLICATION" on all tables involved, set up any dbs needed and IMPORT these dumps as needed, etc, to prepare a destination for replicating into. We need to introduce a mechanism by which we can set up a replication dump at a larger scale than just tables, say, at a DB level. For this purpose, the best fit seemed to be a new tool or command, similar to mysqldump.

...