THIS IS A TEST INSTANCE. ALL YOUR CHANGES WILL BE LOST!!!!
...
- While the MetaStore events feature allows the sinking of notifications to anything implementing
MetaStoreEventListener
, the implementation of the Replication feature can only source events from the MetaStore database and hence theDbNotificationListener
must be used. - Data appended to tables or partitions using the HCatalogWriters will not be automatically replicated as they do not currently generate MetaStore notifications (HIVE-9577). This is likely only a consideration if data is being written to table by processes outside of Hive.
...
- With the MetaStore event configuration in place on the source cluster, the
NOTIFICATION_LOG
table in the MetaStore will be populated with events on the successful execution of metadata operations such asCREATE
,ALTER
, andDROP
. - These events can be read and converted into
ReplicationTasks
usingorg.apache.hive.hcatalog.api.HCatClient.getReplicationTasks(long, int, String, String).
ReplicationTasks
encapsulate a set of commands to execute on the source Hive instance (typically to export data) and another set to execute on the replica instance (typically to import data). The commands are provided as HQL strings.- The
ReplicationTask
also serves as a place where database and table names mappings can be declared andStagingDirectoryProvider
implementations configured for the resolution of paths at both the source and destination:org.apache.hive.hcatalog.api.repl.ReplicationTask.withDbNameMapping(Function<String, String>)
org.apache.hive.hcatalog.api.repl.ReplicationTask.withTableNameMapping(Function<String, String>)
org.apache.hive.hcatalog.api.repl.ReplicationTask.withSrcStagingDirProvider(StagingDirectoryProvider)
org.apache.hive.hcatalog.api.repl.ReplicationTask.withDstStagingDirProvider(StagingDirectoryProvider)
- The HQL commands provided by the tasks must then be executed against the source Hive and then the destination (aka the replica). One way of doing this is to open up a JDBC connection to the respective HiveServer and submit the tasks task's HQL queries.
- It is necessary to maintain the position within the notification log so that replication tasks are applied only once. This can be achieved by maintaining a record of the last successfully executed event's id (
task.getEvent().getEventId()
) and providing this as an offset this when sourcing the next batch of events. - To avoid losing or missing events that require replication, it may be wise to poll for replication tasks at a frequency significantly greater that derived from the
hive.metastore.event.db.listener.timetolive
property. If notifications are note consumed in a timely manner they may eventually be purged from the table and thus no longer be available for consumption.
Replication to AWS/EMR/S3
...