Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Authors: Yu Li, Pengfei Li

Introduction

CopyOnWriteSkipListStateMap is an implementation of skip list with copy-on-write support. This data structure is used to store state for off-heap/on-disk in SpillableHeapKeyedStateBackend. Like CopyOnWriteStateMap, copy-on-write support makes it possible to snapshot asynchronously.

...

A node can be removed physically or logically. Node can only be removed physically by main thread when there is no running snapshot, and spaces used by key and all values will be freed after remove. If there are running snapshots, the node may be used by some snapshots, and can only be removed logically , that’s just set (setting the status of node to removed). In the sync part of snapshot, number of logically removed nodes will be checked, and if the number exceeds the threshold, some nodes will be removed physically.

...

When older snapshots are completed, some older versions of value should be freed which is called prune. A snapshot will check whether older versions of value can still be used by some snapshots when iterating nodes, and prune useless values. Because there may be concurrent snapshots, so we should synchronize when pruning values in a node , that’s (a node can be pruned only by one snapshot at the mean timesimultaneously). For this purpose, CopyOnWriteSkipListStateMap will create a concurrent set pruningNodesSet,and pruningNodesSet, and a snapshot will check whether a node is in the set before it starts to prune the values. If the set already contains the node, that’s it means some other snapshot is pruning the its values , and the current snapshot should not prune it again, otherwise the snapshot will put the node to into the set and start to prune values.

Firstly, we let's introduce two rules:

  • highestFinishedSnapshotVersion: a variable maintained in CopyOnWriteSkipListStateMap to indicate that all snapshots whose versions are no more than this value have been completed
  • the value of a node belonged to a snapshot is the one whose version is maximum of all versions less than the snapshot version. For example, a node has version 8, version 4 and version 2, and current snapshot version is 5, so the value of version 4 belongs to this snapshot

When snapshot prunes values in a node, it will iterate values from the newest to the oldest, and if a value whose version is no more than highestFinishedSnapshotVersion is found, the older values than this value it will not be used by main thread and any snapshot, so they can be freed safely. But the value itself can not be freed for the following two reasons:

...

  • When highestFinishedSnapshotVersion is 5, and snapshot version is 9, the value of version 4 is found, and then the older values than version 4 should be freed. In this case, version 4 may be still used by snapshot 6, so it can not be freed

...