ID | IEP-12 |
Author | |
Sponsor | |
Created | |
Status | DRAFT |
Data consistency problems were detected in some failure scenarios for atomic cache. Current behavior and suggestions to fix will be added to this page.
As first step documentation (javadoc) has been updated in rev d7987e6d5f633d6d0e4dc8816387efcba7bafbdd
Imagine partitioned ATOMIC cache with 2 backups configured (primary + 2 backup copies of the partition). One of possible scenarios to get to primary and backups divergence is the following - update initiating node sends update operation to primary node, primary node propagates update to 1 of 2 backups and then dies. If initiating node crashes as well (and therefore cannot retry the operation) then in current implementation system comes to a situation when 2 copies of the partition present in cluster may be different. Note that both situations possible - new primary contains the latest update and backup does not and vice versa. New backup will be elected according to configured affinity and will rebalance the partition from random owner, but copies may not be consistent due to described above.
This problem does not affect TRANSACTIONAL caches as 2PC protocol deals with scenarios of the kind very well.
The following suggestions should fix this issue with ATOMIC caches:
Cache topology should be persisted for persistence caches to handle cases when primary node is not the same as suggested by affinity function (due to partition counter).
Rebalancing procedure should be improved to support situation when node holds a copy of partition which is several steps behind primary one (with max counter) for node not to drop current copy, but merge states from current primary. Probably cacheMapEntry.initialVersion() should examine cache versions and apply new value if passed in version is greater. This should address the situation when partition copy with greater counter value may have some keys that have already been in cache and have been updated but not propagated to all copies. This way there is no need to drop partition copies that are behind from partition counter standpoint.
One more change can be considered - it will be possible to send entry processors to backup nodes instead of sending values in cases backups hold partition in OWNING state. Value should be sent to ones in MOVING state.
The following risks are possible:
TBD
TBD
TBD