Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Typos

...

  • While the MetaStore events feature allows the sinking of notifications to anything implementing MetaStoreEventListener, the implementation of the Replication feature can only source events from the MetaStore database and hence the DbNotificationListener must be used.
  • Data appended to tables or partitions using the HCatalogWriters will not be automatically replicated as they do not currently generate MetaStore notifications (HIVE-9577). This is likely only a consideration if data is being written to table by processes outside of Hive.

...

  • With the MetaStore event configuration in place on the source cluster, the NOTIFICATION_LOG table in the MetaStore will be populated with events on the successful execution of metadata operations such as CREATE, ALTER, and DROP.
  • These events can be read and converted into ReplicationTasks using org.apache.hive.hcatalog.api.HCatClient.getReplicationTasks(long, int, String, String).
  • ReplicationTasks encapsulate a set of commands to execute on the source Hive instance (typically to export data) and another set to execute on the replica instance (typically to import data). The commands are provided as HQL strings.
  • The ReplicationTask also serves as a place where database and table names mappings can be declared and StagingDirectoryProvider implementations configured for the resolution of paths at both the source and destination:
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withDbNameMapping(Function<String, String>)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withTableNameMapping(Function<String, String>)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withSrcStagingDirProvider(StagingDirectoryProvider)
    • org.apache.hive.hcatalog.api.repl.ReplicationTask.withDstStagingDirProvider(StagingDirectoryProvider)
  • The HQL commands provided by the tasks must then be executed against the source Hive and then the destination (aka the replica). One way of doing this is to open up a JDBC connection to the respective HiveServer and submit the tasks task's HQL queries.
  • It is necessary to maintain the position within the notification log so that replication tasks are applied only once. This can be achieved by maintaining a record of the last successfully executed event's id (task.getEvent().getEventId()) and providing this as an offset this when sourcing the next batch of events.
  • To avoid losing or missing events that require replication, it may be wise to poll for replication tasks at a frequency significantly greater that derived from the hive.metastore.event.db.listener.timetolive property. If notifications are note consumed in a timely manner they may eventually be purged from the table and thus no longer be available for consumption.

Replication to AWS/EMR/S3

...