Improve data inconsistency in replicated regions

To be Reviewed By: 1 March 2022

Authors: Mario Ivanac

Status: Draft | Discussion | Active | Dropped | Superseded

Superseded by: N/A

Related: N/A

Problem

For example, we have configured replicated region with 3 or more servers, and client is configured with read timeout set to value same or smaller than member timeout.

In case while client is putting data in region, one of replicated servers is shutdown, it is observed that we will have data inconsistency.

As a result, you will see that part of data is written in server connected with client, but in remaining replicated servers it is missing.

After analysis, it is observed, as client puts data in connected server, that server will replicate data to other servers. If any server in chain of replicated servers is restarted, replication will be halted until server is declared dead.

Since clients read-timeout is lower then member-timeout, client will continue with insertion of data. And again, replication will be halted to replicated servers. At some moment, restarted member will be declared dead, and now we will have halted events to put in replicated servers, and in parallel new events from client. If new event from client is replicated to other servers, before halted events, then in moment that halted events are notified to replicated servers, they will assume that these are duplicate events (compare sequence Id of received event to last stored sequence ID), and will not store this data.

In linked PR you can find distributed test for reproduction of problem.

Anti-Goals

Solution

When replicating to other servers, current logic would first create connections toward all servers, and then replicate data over those connections. Creation of connections will last, until all connections are created, or destination server/s is/are declared dead.

New proposal is, when replicating data, first try to create connections to all destinations in one attempt. For all created connections we will replicate data. After this step is completed, we will try to create remaining connections to all unreachable destinations, until connections are established, or destination is declared dead. If connections are established we will replicate data.

Linked PR: https://github.com/apache/geode/pull/7381

Other solution explored that have been discarded is:

Add logic, for storing of last N (configurable value) key - sequence Id pairs, instead of only last sequence Id.

User will configure size of last stored "key - sequence Id" Map.

Default value is 0, feature disabled

Value greater than 0, feature activated (calculate according to formula = 3 x member-timeout / read-timeout)

PR with proposed solution: https://github.com/apache/geode/pull/7214

Changes and Additions to Public Interfaces

None

Performance Impact

None

Backwards Compatibility and Upgrade Path

No upgrade or backwards compatibility issues.

Prior Art

What would be the alternatives to the proposed solution? What would happen if we don’t solve the problem? Why should this proposal be preferred?

FAQ

Answers to questions you’ve commonly been asked after requesting comments for this proposal.

Errata

What are minor adjustments that had to be made to the proposal since it was approved?

LikeBe the first to like this

Space shortcuts

Page tree