Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

However, the idea behind the snapshot is still what we really want, and if HDFS cannot support the number of snapshots that we would create, it is possible for us to do a pseudo-snapshot, so that for all files that are backing hive objects, if we detect any hive operation would move them away or modify them, we retain the original in a separate directory, similar to how we manage Trash. This pseudo-trash like capturing behaviour is what we refer to as the "change-management" piece and is the main piece that needs to be in place to solve the rubberbanding problem as well as the 4x copy problem.

_metadata

The intent of generating metadata for an event on the source, is to store enough metadata in the notification object of an event, to be able to replicate the exact event on the destination side. As an example, consider the following SQL issued at the source:

CREATE TABLE IF NOT EXISTS default.person (name string, age int);

In this case, a corresponding "CREATE TABLE" metastore event (with a unique id - assume 100 in this case) will be generated, which will generate a new notification to be stored in metastore DB. We plan to store the entire table object (corresponding to the newly created default.person) in the metastore DB as part of the notification event.

Subsequent to the initial bootstrap dump/load), consider the following REPL DUMP command issued at source:

REPL DUMP default.person FROM 100;

The dump command will read the table metadata for the event 100 and generate an appropriate _metadata file which can be replayed at the destination to reach the identical state.

_files

Currently, when we do an EXPORT of a table, the directory structure created in this dump has, at its root, a _metadata file that contains all the metadata state to be impressed, and then has directory structures for each partition to be impressed.

...