Notes:
...
Partition Map Exchage process is following
Cluster members change causes major topology update (e.g. 5.0→6.0), cache creation or destroy causes minor minor version update (e.g. 6.0→6.1).
Initialization of exchange means adding future GridDhtPartitionsExchangeFuture to queue on each node.
...
On receiving full map for current exchange, nodes calls onDone() for this future. Listeners for full map are called (order is no guaranteed).
Example of possible partitions state for 3 nodes, 3 partition, 1 backup per partition
Node 1: Primary P1 - Owning
Backup P2 - Owning
Node 2: Primary P2 - Owning
Backup P3 - Owning
Node 3: Backup P1 - Owning
Primary P3 - Owning
Each node knows following:
Server node with lowest order (an oldest alive node in topology) is coordinator (there is always one such node) - Usually 'crd' in code
Let's suppose now new node is joining (it would be Node 4).
All other nodes know state of partitions that coordinator had observed some time ago.
Partition mapping is mapping of following: UUID -> partition ID -> State
Step 1. Discovery event triggers exchange
Topology version is incremented
Affinity generates new partition assignment
For example, Partition 3 Backup is now assigned to new node 4 (to be moved to N4)
Step 2. Node 4: Partition 3 Backup is created at new node at state Moving:
Node 2: Primary P2 - Owning
Backup P3 - Owning
...
Node 4:
Backup P3 - Moving
Node 2 does not throw data and does not change partition state even it is not owning partition by affinity on new topology version.
Step 3. Node 4 will issue demand requests for cache data to any node which have partition data.
Step 4. When all data is loaded (last key is mapped) Node 4 locally changes state to Owning (other nodes think it is Moving)
Step 5. Node 4 after some delay (timer) sends single message to Coordinator (this message may have absent/'null' version of exchange).
Step 6. Coordinator sends updated full map to other nodes.
Step 7. Node 2: observes that
Partition can be safely removed.
Partition state is changed to Renting locally (other nodes think it is Owning).
Step 8. Node 2 after some delay (timer) sends single message to crd.
Step 9. Coordinator sends updated full map to other nodes.
All nodes eventually get renting state for P3 backup for Node 4.
If more topology updates occurred, this causes more cluster nodes will be responsible for same partition at particular moment.
Clear cache data immediately is not possible because of SQL indexes. These indexes covers all partitions globally. Elements removed one by one. Later state will be Evicted.
DiscoveryEvent is sent to cluster using ring (DiscoverySpi).
Synchronous exchange is required for primary partition rebalancing to new node.
If such exchange is not sync: There is possible case that 2 primaries exists in Owning state. New node changed state locally, old node does not received update yet. In that case lock requests may be sent to different nodes (see also Transaction section).
As result, primary migration requires global synchronous exchange. It should not be running in the same time with transactions.
For protection there is Topology RW Lock
RW gives semantic N transactions, or 1 exchange.
Optimisation, Centralized affinity, flag - true by default 2.0+, the only possible option since 2.1+.
Affinity assigns primary partition (to be migrated) to new node. But instead we create temporary backup at this new node.
When temporary backup loads all actual data it becomes primary.
Additional synthetic Exchange will be issued.