Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Similar to case(a), but sets up db-level replication that excludes table/view 'Q4' and all table/view names that have prefix 'T' and numeric suffix of any length. For example, 'T3', 'T400', 't255' etc. The table/view names are case-insensitive in nature and hence table/view name with prefix 't' would also be excluded from dump.

...

The presence of a FROM <init-evid> tag makes this dump not a bootstrap, but a dump which looks at the event log to produce a delta dump. FROM 200 TO 1400 is self-evident in that it will go through event ids 200 to 1400 looking for events from the relevant db.

...

This is an example of changing the replication policy/scope dynamically during incremental replication cycle.

In first case, a full DB replication policy "sales" is changed to a replication policy that includes only table/view names with only alphabets "sales.['[a-z]+']" such as "stores", "products" etc. The REPL LOAD using this dump would intelligently drops the tables which are excluded as per the new policy. For instance, table with name 'T5' would be automatically dropped during REPL LOAD if it is already there in target cluster.

In second case, policy is again changed to include table/view 'Q5' and in this case, Hive would intelligently bootstrap the table/view 'Q5' in the current incremental dump. The same is applicable for table/view renames where 

(i) REPL DUMP sales WITH ('hive.repl.include.external.tables'='false', 'hive.repl.dump.metadata.only'='true');

The REPL DUMP command has an optional WITH clause to set command-specific configurations to be used when trying to dump. These configurations are only used by the corresponding REPL DUMP command and won't be used for other queries running in the same session. In this example, we set the configurations to exclude external tables and also include only metadata and don't dump data. 


Return values:

  1. Error codes returned as return error codes (and over jdbc if with HS2)
  2. Returns 2 columns in the ResultSet:
    1. <dir-name> - the directory to which it has dumped info.
    2. <last-evid> - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be.

Note:

Now, the dump generated will be similar to the kind of dumps generated by EXPORTs, in that it will contain a _metadata file, but it will not contain the actual data files, instead using a _files file as an indirection to the actual files. One more aspect of REPL DUMP is that it does not take a directory as an argument on where to dump into. Instead, it creates its own dump directory inside a root dir specified by a new HiveConf parameter, hive.repl.rootdir , which will configure a root directory for dumps, and returns the dumped directory as part of the return value from it. It is intended also that we will introduce a replication dumpdir cleaner which will periodically clean it up.

This call is intended to be synchronous, and expects the caller to wait for the result.

If HiveConf parameter hive.in.If HiveConf parameter hive.in.test is  false, REPL DUMP will not use a new dump location, thus it will garble an existing dump. Hence before taking an incremental dump, clear the bootstrap dump location if hive.in.test is false.

Return values:

  1. Error codes returned as return error codes (and over jdbc if with HS2)
  2. Returns 2 columns in the ResultSet:
    1. <dir-name> - the directory to which it has dumped info.
    2. <last-evid> - the last event-id associated with this dump, which might be the end-evid, or the curr-evid, as the case may be.

This call is intended to be synchronous, and expects the caller to wait for the result.in.test is false.

Bootstrap note : The FROM clause means that we read the event log to determine what to dump. For bootstrapping, we would not use FROM.

...

This causes a REPL DUMP present in <dirname> (which is to be a fully qualified HDFS URL) to be pulled and loaded. If <dbname> is specified, and the original dump was a database-level dump, this allows Hive to do db-rename-mapping on import. If <dbname>.<tablename> was specified, and the original dump was a table-level dump, then this allows us to do a table-rename-mapping on import. If neither dbname nor tablename is not specified, the original dbname and tablename are used, as recorded in the dump would be used.

The REPL LOAD command has an optional WITH clause to set command-specific configurations to be used when trying to copy from the source cluster. These configurations are only used by the corresponding REPL LOAD command and won't be used for other queries running in the same session.

...

REPL STATUS

REPL STATUS <dbname>[.<tablename>];


Will return the same output that REPL LOAD returns, allows REPL LOAD to be run asynchronously. If no knowledge of a replication associated with that db / db .tbl is present, i.e., there are no known replications for that, we return an empty set. Note that for cases where a destination db or table exists, but no known repl exists for it, this should be considered an error condition for tools calling REPL LOAD to pass on to the end-user, to alert them that they may be overwriting an existing db /table with another.

Return values:

  1. Error codes returned as normal.
  2. Returns the last replication state (event ID) for the given database.

Bootstrap, Revisited

When we introduced the notion of a need for bootstrap, we said that the problem of time passing during the bootstrap was something of a problem that needed solving separately.

...