Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Authors: Alberto Gomez (alberto.gomez@est.tech)

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

...

  • Backup/restore:
    - It only works for persistent regions.
    - It does not support the restoration of a backup taken from a system in a newer version of Geode into a system in an older version.
    - It is not intended to be used to restore a backup from one site to another but rather to restore a previous backup of one site into the same site.
    - It cannot guarantee consistency of data across regions.
    - If the data in the region was populated using transactions, there would be no guarantee that those transactions are honored in the region unless there was not traffic ongoing when the backup was taken (as transaction events are not written atomically on disk).
    - The whole process to recover the WAN replication between the sites after the restore without stopping the service is cumbersome. It requires a procedure to clean replication queues in the master site, restart the replication right before the backup is taken and have available memory/disk resources to hold replication events while the remote site is not yet restored.
  • Import/export:
    - As it is based on the Snapshot service, it does not guarantee consistency of the data unless it is run when there is no traffic running.
    - It cannot guarantee consistency of snapshots across regions.
    - It suffers the same problem as backup and restore with respect to consistency of transactional data.
    - It suffers the same problems as backup and restore with respect to the process to recover the WAN replication without stopping the service.
  • Gemtouch:
    - Gemtouch is an OpenSource tool not part of Geode and no longer maintained.
    - As this tool replicates the entries of a region by means of generating a get and a put for every entry, it causes the undesired modification of the region entries (at least of the timestamp).
    - Also, even if the gets and puts of every entry is done inside a transaction, the process is subject to race conditions if it is run while non-transactional traffic is being sent to the Geode cluster. Such race conditions could provoke that traffic writes are overwritten by Gemtouch writes if the former writes are not done inside transactions.

This RFC tries to solve the specific problem of putting back into service (or for the first time) a Geode site that is to be part of a Geode WAN replicated system, that needs to load the data of the regions of the other Geode sites already running and at the same time get the new events generated via WAN replication in the source site.

The solution should be suitable for any WAN replication topology.

Anti-Goals

What is outside the scope of what the proposal is trying to solve?This RFC does not try to solve the problem of synchronizing on the fly two WAN replicated sites that may have diverged on the fly. The use case aimed at consists of having one or more WAN sites online with data and clients connected to them and a another site with no clients connected that needs to get the data from the other sites.

Solution

To overcome the different disadvantages of each of the mechanisms enumerated in the Problem section and to offer a more integrated solution in Geode, this RFC proposes to implement a command that replicates the data from a region in a site to another site that operates in a similar way as the Gemtouch tool but without the inconveniences shown above.

...

It would be desirable that another command is provided in order to stop an ongoing replication started by the replicate region command.

The command will assume that the destination site has the regions to be replicated from another site already created and that it has gateway receivers to get the data from the source site.

The command will also assume that during the replication of the data there will be no clients connected to destination site until the copy has finished. If, for any reason, the replication fails or does not finish, the command will have to be run again. The command should provide information about the result of the execution.

Events replicated by this command, must not be notified to the clients in the remote site and should not be sent to other sites from the remote site.

Changes and Additions to Public Interfaces

...

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

...

  • Instead of putting events in the sender's queue, the command will put events directly in batches and pass them to the gateway sender to be sent to the remote site.
  • The name of the command has been changed to: wan-copy region