ID	IEP-12
Author
Sponsor
Created
Status	DRAFT

Motivation

Data consistency problems were detected in some failure scenarios for atomic cache. Current behavior and suggestions to fix will be added to this page.

As first step documentation (javadoc) has been updated in rev d7987e6d5f633d6d0e4dc8816387efcba7bafbdd

Imagine partitioned ATOMIC cache with 2 backups configured (primary + 2 backup copies of the partition). One of possible scenarios to get to primary and backups divergence is the following - update initiating node sends update operation to primary node, primary node propagates update to 1 of 2 backups and then dies. If initiating node crashes as well (and therefore cannot retry the operation) then in current implementation system comes to a situation when 2 copies of the partition present in cluster may be different. Note that both situations possible - new primary contains the latest update and backup does not and vice versa. New backup will be elected according to configured affinity and will rebalance the partition from random owner, but copies may not be consistent due to described above.

This problem does not affect TRANSACTIONAL caches as 2PC protocol deals with scenarios of the kind very well.

Description

The following suggestions should fix this issue with ATOMIC caches:

Fully switch to thread per partition for ATOMIC caches including putAlls, removeAlls, etc. Batch operations should be split into proper per-partition sets. This guarantees that each partition is updated only from one thread and removes a lot of sync problems to make sure that all updates are handled in the same sequence.
Maintain per-partition update counter (already done).
Once primary node crashes for some partition all nodes still holding copies in topology share their partition counters. Node currently holding partition with max counter should be considered to be the most up to date. It should become primary for next topology version and all affinity nodes should switch local states to MOVING and rebalance partition state from current primary (with max partition counter).

Cache topology should be persisted for persistence caches to handle cases when primary node is not the same as suggested by affinity function (due to partition counter).

Rebalancing procedure should be improved to support situation when node holds a copy of partition which is several steps behind primary one (with max counter) for node not to drop current copy, but merge states from current primary. Probably cacheMapEntry.initialVersion() should examine cache versions and apply new value if passed in version is greater. This should address the situation when partition copy with greater counter value may have some keys that have already been in cache and have been updated but not propagated to all copies. This way there is no need to drop partition copies that are behind from partition counter standpoint.

One more change can be considered - it will be possible to send entry processors to backup nodes instead of sending values in cases backups hold partition in OWNING state. Value should be sent to ones in MOVING state.

Risks and Assumptions

The following risks are possible:

Performance risk - Per-operation latency may increase for batch operations due to per-partition split and processing subsets in separate threads.
Complexity risk - Cache topology should be persisted for persistence caches.
Complexity risk - Changing to rebalancing procedure require thorough tests.

Discussion Links

TBD

Reference Links

TBD

Tickets

TBD

Page tree

IEP-12 Make ATOMIC Caches Consistent Again