Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

With this notion of EXPORT creating _files as indirections to the actual files, and IMPORT loading _files to locate the actual files needing copying, we solve the 4x copy problem.

Bootstrap

One of the requests we got was that by offloading too much of the requirements of replication, we push too much "hive knowledge" over to the tools that integrate with us, asking them to essentially bootstrap the destination warehouse to a point where it is capable of receiving incremental updates. Currently, we recommend that users run a manual "EXPORT ... FOR REPLICATION" on all tables involved, set up any dbs needed and IMPORT these dumps as needed, etc, to prepare a destination for replicating into. We need to introduce a mechanism by which we can set up a replication dump at a larger scale than just tables, say, at a DB level. For this purpose, the best fit seemed to be a new tool or command, similar to mysqldump.

(Note, in this section, I constantly refer to mysql and mysqldump, not because this is the only solution out there but because I'm a little familiar with it. Other dbs have equivalent tools)

There are a couple of major differences, however, between expectations we have of something like mysqldump, and a command we implement:

  1. The scale of the data involved in an initial dump is orders more for a hive warehouse as compared to a typical mysql db.
  2. Transactional isolation & log-based approaches means that mysqldump can have a stable snapshot of the entire db/metadata during which it proceeds to dump out all dbs and tables. So, even if it takes a while to dump them out, it need not worry about the objects changing while it gets dumped. We, on the other hand, need to handle that.

 


The first point can be solved by using our change-management semantics that we're developing, and using the lazy _files approach rather than a CopyTask.

The second part is a little more involved, and needs to do some consolidation during the dump generation. Let us say that we begin the dump at evid=170, and by the time we finish the dump of all objects contained in our dump, it is now evid=230. For a consistent picture of the dump, we now also have to consolidate the information included in events 170-230 into our dump before we can pass the baton over to incremental replication. We will discuss this shortly, after a brief detour of new commands we introduce to manage the replication dump and reload.

New commands to help us

REPL DUMP

Syntax:

REPL DUMP <dbname>[.<tablename>] [FROM <init-evid> [TO <end-evid>] [BATCH <num-evids>] ];

This is better described via various examples of each of the pieces of the command syntax, as follows:


(a) REPL DUMP sales;

 Replicates out sales database for bootstrap, from <init-evid>=0 (bootstrap case) to <end-evid>=<CURR-EVID> with a batch size of 0, i.e. no batching.

(b) REPL DUMP sales.Q3;

 Similar to case (a), but sets up table-level replication instead of db-level repl.

(c) REPL DUMP sales FROM 200 TO 1400;

 The presence of a FROM <init-evid> tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump.

 FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db.

(d) REPL DUMP sales FROM 200;

 Similar to above, but with an implicit assumed <end-evid> as being the current event id at the time the command is run.

(f) REPL DUMP sales FROM 200 to 1400 BATCH 100;

   REPL DUMP sales FROM 200 BATCH 100;

Similar to cases (d) & (e), with the addition of a batch size of <num-evids>=100. This causes us to stop processing if we reach 100 events, and return at that point. Note that this does not mean that we stop processing at event id = 300, since we began at 200 - it means that we will stop processing events when we have processed 100 events in the event stream (that has unrelated events) belonging to this replication-definition, i.e. of a relevant db or db.table, then we stop.

Return values:

  1. Error codes returned as return error codes (and over jdbc if with HS2)
  2. Returns 2 columns in the ResultSet:
    1. <dir-name> - the directory to which it has dumped info.
    2. <last-evid> - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be.

This call is intended to be synchronous, and expects the caller to wait for the result.


Bootstrap note : The FROM clause means that we read the event log to determine what to dump. For bootstrapping, we would not use FROM.

REPL LOAD

REPL LOAD [<dbname>[.<tablename>]] FROM <dirname>;


This causes a repl dump present in <dirname> (which is to be a fully qualified hdfs url) to be pulled and loaded. If <dbname> is specified, and the original dump was a db-level dump, this allows us to do db-rename-mapping on import. If <dbname>.<tablename> was specified, and the original dump was a table-level dump, then this allows us to do a table-rename-mapping on import. If neither dbname nor tablename are specified, the original dbname and tablename are used, as recorded in the dump.

Return values:

  1. Error codes returned as normal.
  2. Does not return anything in ResultSet, expects user to run REPL STATUS to check.

 

REPL STATUS

REPL STATUS <dbname>[.<tablename>];


Will return the same output that REPL LOAD returns, allows REPL LOAD to be run asynchronously. If no knowledge of a replication associated with that db / db.tbl is present, i.e., there are no known replications for that, we return an empty set. Note that for cases where a destination db or table exists, but no known repl exists for it, this should be considered an error condition for tools calling REPL LOAD to pass on to the end-user, to alert them that they may be overwriting an existing db/table with another.