Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • With the MetaStore event configuration in place on the source cluster, the NOTIFICATION_LOG table in the MetaStore will be populated with events on the successful execution of metadata operations such as CREATE, ALTER, and DROP.
  • These events can be read and converted into ReplicationTasks using org.apache.hive.hcatalog.api.HCatClient.getReplicationTasks(long, int, String, String).
  • ReplicationTasks encapsulate a set of commands to execute on the source Hive instance (typically to export data) and another set to execute on the replica instance (typically to import data). The commands are provided as HQL strings.
  • The ReplicationTask also serves as a place where database and table names mappings can be declared and StagingDirectoryProvider implementations configured for the resolution of paths at both the source and destination:
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withDbNameMapping(Function<String, String>)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withTableNameMapping(Function<String, String>)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withSrcStagingDirProvider(StagingDirectoryProvider)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withDstStagingDirProvider(StagingDirectoryProvider)
  • The HQL commands provided by the tasks must then be executed against the source Hive and then the destination (aka the replica). One way of doing this is to open up a JDBC connection to the respective HiveServer and submit the task's HQL queries.
  • It is necessary to maintain the position within the notification log so that replication tasks are applied only once. This can be achieved by maintaining a record of the last successfully executed event's id (task.getEvent().getEventId()) and providing this as an offset this when sourcing the next batch of events.
  • To avoid losing or missing events that require replication, it may be wise to poll for replication tasks at a frequency significantly greater that derived from the hive.metastore.event.db.listener.timetolive property. If notifications are note not consumed in a timely manner they may eventually be purged from the table and thus no longer be available for consumptionbefore they can be actioned by the replication service.

Replication to AWS/EMR/S3

...