Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Status

Current state: UNDER DISCUSSIONACCEPTED

Discussion thread: <link to mailing list DISCUSS thread> https://mail-archives.apache.org/mod_mbox/samza-dev/202106.mbox/%3cCAMja7KdMNU_Zk-vDnwcm4GSJs==126-mU6dJgTsoukzkxzf+qw@mail.gmail.com%3e

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keySAMZA-2657

...

The motivation of introducing Blob Store as backend for backup and restore in addition to Kafka changelogs is to address the following issues:

  1. Long restore times: State restore from Kafka backed changelog topic can be slow because the restore happens at "one message at a time" cadence. With Blob Store as state restore backend, it is possible to restore state store in bulk, thereby reducing the restore times from hours to the order of minutes.
  2. Automate Job and Cluster maintenance: Slow restore times can also hinders the ability to automate job and cluster maintenance. Since restoring state from the changelog topic can take a long time, automated tasks like auto-scaling job, OS upgrades, hardware upgrades etc. need to be planned more carefully and cannot be completely automated. With a fast restore from Blob Store based restore, it is possible to reduce the duration and impact of such maintenance tasks. 
  3. Message and store size limits: Moving to a Blob Store backed storage would allow us to write messages of arbitrary size and expand the store to any size.

...

A detailed description of RocksDB checkpoint process can be found here: Checkpoints in RocksDB.

Proposed Changes

Design overview 

We propose a Blob Store based state backup and restore for Samza. Following diagram illustrates a high-level view of the solution:

...

  • Step 1: A synchronous checkpoint call is made to RocksDB that creates an immutable checkpoint directory in the container filesystem (shown in step 1a). We also generate a checkpoint ID (last 6 digits of epoch timestamp in nanoseconds) and build an in-memory data structure called checkpoint to hold checkpoint ID and relevant metadata.
  • Step 2:  We write the checkpoint ID to the checkpoint directory. The checkpoint ID in checkpoint directory helps us in verifying if the local checkpoint directory is the latest checkpoint when the container restarts with host-affinity by verifying if the checkpoint ID in the checkpoint directory and in the checkpoint kafka topic are the same or not. This allows us to skip remote restoration and use the local checkpoint instead. 
  • Step 3 to step 6: We calculate the delta of this checkpoint directory from the previous checkpoint and upload this delta to the blob store in parallel to processing. Blob store returns a blob ID for each upload. We create a special index blob in the blob store with these blob IDs (and other metadata) called Index blob. Schema of Index Blob ID is shown explained in the code block belowsection Organization of snapshot in Blob Store. Notice the filesRemoved and subDirectories removed sections. They are maintained for Garbage collection and error recovery. This is explained in detail in the next section: Post-commit cleanup and Garbage Collection.
  • Step 6 and 7: We write the blob ID of the index blob to the checkpoint data structure, serialize this data structure and write the serialized checkpoint to the checkpoint topic. 

...

  1. All the files corresponding to a checkpoint are uploaded in parallel to the Blob Store. Blob Store returns a blob ID associated with every file. Note that large files (multiple GBs) can be chunked and uploaded, and can have one blob ID per chunk. Thus, a file has one or more blob IDs associated with it. 
  2. We organize and index the blobs associated with a checkpoint by creating an Index blob in the Blob Store that holds file name, blob IDs and metadata of all the files associated with a checkpoint. Schema defined below shows the fields in index blob. Note: The filesRemoved section in the schema helps in garbage collection. Please see the next section on Garbage collection for Post-commit cleanup and Garbage Collection for details.
  3. Retrieving an Index blob is necessary and sufficient to rebuild a snapshot since the Index blob contains blob ID and metadata of every file associated with a checkpoint directory.   

...

At the end, using the blob IDs of the newly uploaded and deleted files, we recreate a new Index blob, serialize it and upload it. This concludes our the incremental backup.

Incremental restore is just the reverse of the incremental backup outlined above. We consider the Checkpoint topic as our the source of truth, and restore the checkpoint directory target by retrieving the Index blob ID from Checkpoint obtained from Checkpoint topic. We then retrieve the Index blob from the Blob Store. Index blob contains lists called filesAdded and filesRemoved. We retrieve the filesAdded blob list from the Blob Store and add them to the checkpoint directory. Similarly, we delete the filesRemoved list from the local checkpoint directory. We also compare the same files in remote and local checkpoint directory and match their checksum, size and permission attributes to see if they are the same file. If a file’s content has changed since the remote checkpoint was created, we restore the file from the remote checkpoint and delete the local file.

Image RemovedImage Added

Post-Commit cleanup and Garbage Collection

...

Blob Stores like Azure and AWS S3 allow every blob to have a TTL associated with it. Blobs are garbage collected at the end of the TTL. Additionally, we implement a method expose an API in BlobStoreManager interface called REMOVE_TTL to allow a blob’s TTL to be updated to never expire. A second call to REMOVE_TTL  on the same blob has no effect and will be a no-op. We use these facts to ensure garbage or untracked blobs are cleaned up in the Blob Store. 

  1. As part of the commit, when we upload a file for checkpointing, we create a blob with a TTL of 30 days. This ensures that if the commit sequence fails at any step, our the blobs are automatically garbage collected at the end of TTL. 
  2. Files are not deleted immediately and are kept around until the commit is complete in case they are relevant for rollback or otherwise. Rather, any files to be deleted are added to the filesRemoved section of the Index blob schema as explained in this section
  3. Commit completes after the checkpoint ID is written to the checkpoint topic. At this point, we update all the blobs created in that commit sequence to never expire using a REMOVE_TTL() request to the Blob Store. We also send delete requests at the end of the commit phase.
  4. Whenever a job restores, we perform 2 operations as part of init operation:
    1. Send REMOVE_TTL to never expire all the blobs in the Index blob. 
    2. Send delete requests for all the blobs in the cleanup section of the Index blob. 
  5. Step 4 ensures that if a job fails in the post commit phase, we can reclaim the blobs and they are not garbage collected, and ensures that blobs to be deleted are not left behind as garbage in the Blob Store. Both the operations have no effect if they have successfully completed earlier.

...