Authors: Yu Li, Pengfei Li

Introduction

CopyOnWriteSkipListStateMap is an implementation of skip list with copy-on-write support. This data structure is used to store state for off-heap/on-disk in SpillableHeapKeyedStateBackend. Like CopyOnWriteStateMap, copy-on-write support makes it possible to snapshot asynchronously.

Design Overview

                             Picture 1. Structure of CopyOnWriteSkipListStateMap

Structure of CopyOnWriteSkipListStateMap is illustrated in picture-1 which is an array-based implementation. Level 0 is a singly linked list, and other levels are doubly linked lists. Doubly linked list makes it convenient to update level index for remove.

                                                           Picture 2. Structure of Node

The structure of Node is illustrated in picture-2. A Node is logical, and consists of a physical space for key and multiple physical spaces for values. The key in the skip list is a composition of serialized key and serialized namespace, and the value in the skip list is the serialized state. Node may contain multiple versions of values to support copy-on-write, where running snapshots may refer to value with old version.

Fields in the space of key are as follows:

  • int: meta of node
    • byte 0: top level of this node
    • byte 1: status of node
    • byte 2 and byte 3: reserved
  • int: length of key
  • long: a pointer to the space of newest value
  • long: a pointer to the key space of next node on level 0
  • long[]: array of pointers to next node on different levels excluding level 0
  • long[]: array of pointers to previous node on different levels excluding level 0
  • byte[]: data of key

Fields in the space of value are as follows:

  • int: version of this value
  • long: a pointer to the key space of node
  • long: a pointer to the space of next older value
  • int: length of value
  • byte[]: data of value

Design Details

CopyOnWriteSkipListStateMap may be accessed in main thread and multiple snapshot threads concurrently. Main thread will read, insert, update and remove a node, and snapshot threads will read node and remove useless version of values. Main thread always read the latest value, and a snapshot will iterate the linked list on level 0, and read the state belonging to this snapshot. Because snapshot threads only access the level 0 of node, we only need to guarantee the thread safety on level 0.

Insert node

                                                                                                                               Picture 3. Insert node

As shown in picture 3, node3 is inserted between node1 and node2 in below way:

  • set the pointer of node3 to the key space of node2
  • set the pointer of node1 to the key space of node3

Update node

The node already has a value of version 4 and a value of version 2, and is updated with a new value.

                                                    Picture 4. Update node


  • Value of version 4 is not being used by any snapshot, and version 4 can be replaced by the new value
    • set the pointer of new value space to the value space of version 2
    • set the pointer of key space to the new value space
    • free the space of version 4

Picture 4.1 Update node when version 4 is not being used by any snapshot


  • Value of version 4 is being used by one or more snapshots, and version 4 can not be removed, so new value should be inserted before the value of version 4
    • set pointer of new value to the value space of version 4
    • set pointer of key space to the new value space

                                          Picture 4.1 Update node when version 4 is being used by on or more snapshots

Remove node

A node can be removed physically or logically. Node can only be removed physically by main thread when there is no running snapshot, and spaces used by key and all values will be freed after remove. If there are running snapshots, the node may be used by some snapshots, and can only be removed logically (setting the status of node to removed). In the sync part of snapshot, number of logically removed nodes will be checked, and if the number exceeds the threshold, some nodes will be removed physically.

Remove useless values by snapshot

When older snapshots are completed, some older versions of value should be freed which is called prune. A snapshot will check whether older versions of value can still be used by some snapshots when iterating nodes, and prune useless values. Because there may be concurrent snapshots, so we should synchronize when pruning values in a node (a node can be pruned only by one snapshot simultaneously). For this purpose, CopyOnWriteSkipListStateMap will create a concurrent set pruningNodesSet, and a snapshot will check whether a node is in the set before it starts to prune the values. If the set already contains the node, it means some other snapshot is pruning its values and the current snapshot should not prune it again, otherwise the snapshot will put the node into the set and start to prune.

Firstly, let's introduce two rules:

  • highestFinishedSnapshotVersion: a variable maintained in CopyOnWriteSkipListStateMap to indicate that all snapshots whose versions are no more than this value have been completed
  • the value of a node belonged to a snapshot is the one whose version is maximum of all versions less than the snapshot version. For example, a node has version 8, version 4 and version 2, and current snapshot version is 5, so the value of version 4 belongs to this snapshot

When snapshot prunes values in a node, it will iterate values from the newest to the oldest, and if a value whose version is no more than highestFinishedSnapshotVersion is found, the older values than it will not be used by main thread and any snapshot, so they can be freed safely. But the value itself can not be freed for the following two reasons:

  • When highestFinishedSnapshotVersion is 4, and snapshot version is 5, the value of version 3 is found, and the older values than version 3 should be freed. In this case, version 3 is still used by main thread and snapshot 5, so it can not be freed

          

  • When highestFinishedSnapshotVersion is 5, and snapshot version is 9, the value of version 4 is found, then the older values than version 4 should be freed. In this case, version 4 may be still used by snapshot 6, so it can not be freed

          

  • No labels