Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With this notion of EXPORT creating _files as indirections to the actual files, and IMPORT loading _files to locate the actual files needing copying, we solve the 4x copy problem.

Change management

Here is a possible solution to the rubberbanding problem described earlier:
For each metastore event for which a notification is generated, store the metadata object (e.g. table, partition etc), the location of the files (associated with the event) and the checksum of each affected file (the reason for storing the checksum is explained shortly). In case of events which delete files (e.g. drop table/partition), move the deleted files to a configurable location on the file system (let's call it $cmroot for purpose of this discussion) instead of deleting them.

Consider the following sequence of commands for illustration:

Event 100: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION <location>; 
Event 110: ALTER TABLE tbl DROP PARTITION (p=1);
Event 120: ALTER TABLE tbl ADD PARTITION (p=1) SET LOCATION <location>;

When loading the dump on the destination side (at a much later point), when the event 100 is replayed, the load task on the destination will try to pull the files from the <location> (the _files contains the path of <location>), which may contain new or different data. To replicate the exact state of the source at the time event 100 occurred at the source, we do the following:

  1. When Event 100 occurs at the source, in the notification event, we store the checksum of the file(s) in the newly added partition along with the file path(s).
  2. When Event 110 occurs at the source, we move the files of the dropped partition to $cmroot/database/tbl/p=1 instead of purging them.
  3. When Event 120 occurs at the source, in the notification event, we again store the checksum of the file(s) in the newly added partition along with the file path(s).

Now when Event 100 is replayed at the destination at a later point, the destination calculates the checksum for file(s) in the partition path. If the checksum differs (for example in this case due to Event 110 and Event 120 that have occurred at the source), the destination looks for those files in $cmroot/database/table/p=1, and pulls them from this location to replicate the state of the source, at the time Event 100 had occurred on the source.

A need for bootstrap

One of the requests we got was that by offloading too much of the requirements of replication, we push too much "hive knowledge" over to the tools that integrate with us, asking them to essentially bootstrap the destination warehouse to a point where it is capable of receiving incremental updates. Currently, we recommend that users run a manual "EXPORT ... FOR REPLICATION" on all tables involved, set up any dbs needed and IMPORT these dumps as needed, etc, to prepare a destination for replicating into. We need to introduce a mechanism by which we can set up a replication dump at a larger scale than just tables, say, at a DB level. For this purpose, the best fit seemed to be a new tool or command, similar to mysqldump.

...