Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


...

Page properties

...


Discussion

...

thread

...


Vote thread

...


JIRA

...


Release1.4


Motivation

The current architecture around the BLOB server and cache components seems rather patched up and has some issues regarding concurrency ([FLINK-6380]), cleanup, API inconsistencies / currently unused API ([FLINK-6329], [FLINK-6008]). These make future integration with FLIP-6 or extensions like offloading oversized RPC messages ([FLINK-6046]) difficult. We therefore propose an improvement on the current architecture as described below which tackles these issues, provides some cleanup, and enables further BLOB server use cases.

Contents

Table of Contents
maxLevel2
excludeStatus|Motivation|Contents

Public Interfaces

The proposed changes mainly affect the back-end and are not user-facing.
Currently, we also do not plan any changes to the configuration or the monitoring information, except for:

...

Gliffy Diagram
nameblob-store-architecture
pagePin8

BlobServer

  • offers file upload and download facilities based on jobId and BlobKey
  • local store (file system): read/write access, using "<path>/<jobId>/<BlobKey>"
  • HA store: read/write access for high availability, using "<path>/<jobId>/<BlobKey>"
  • responsible for cleanup of local and HA storage
  • upload to local store, then to HA (possibly in parallel, but waiting for both to finish before acknowledging)
  • downloads will be served from local storage only
  • on recovery (HA): download needed files from HA to local store, take cleanup responsibility for all other files on the path, i.e. orphaned files, too! (see below)

...

Note that several tasks running on the same TaskManager may use BLOB files of the same job!

BlobServer

All unused BLOB files stored at the BlobServer should also be periodically cleaned up and not just when the BlobServer shuts down (as of Flink 1.3).

  • all blobs are ref-counted, starting from the initial upload
  • the job-specific BLOB sub-directory ("<path>/<jobId>") is not ref-counted (it may be, but this is not necessary here)
  • two types of BLOB lifecycle guarantees: HA (retain for recovery) and non-HA (re-creatable files - not necessary for recovery)
  • if a job fails, all its BLOBs' references are counted down appropriately (if possible) non-HA files' refCounts are reset to 0; all HA files' refCounts remain and will not be increased again on recovery
  • if a job enters a final state, i.e. finished or cancelled, the job-specific BLOB subdirectory ("<path>/<jobId>") and all its BLOBs are deleted immediately and are removed from ref-counting (despite their actual ref-count!)
  • if reference = 0 the BLOB enters the staged cleanup (see above)
  • all blobs should be deleted when the BlobServer exits

...

BlobCache Download

When a file for a given jobId and BlobKey is requested, ...When files are retrieved, the BlobCache tries to download them from the will first try to serve it from its local store (after a successful checksum verification). If it is not available there or the checksum does not match, the file will be copied from the HA store (to the local store if available) or from the BlobServer directly. If this does not work or is not available, a fallback direct download from the BlobServer to the local store via a connection established and managed by the BlobClient is being used. During the transfer, these files will be put into a temporary directory and only submitted to the job-specific path when completely transferred and checksum-verified. This may invoke multiple (concurrent) downloads for the same file but ensures that while serving BLOBs, no incomplete file is being used. We may prevent such multiple downloads by the BlobCache as an optimisation.

BlobServer Upload

While user jars are being uploaded, the corresponding job is not submitted yet and we cannot bind the jar files to a non-existing job's lifecycle. Also, we cannot upload each file at once and use a ref-counter of 0 each or some files may already be deleted when the job is started. Instead, we will upload all jar files together and only after receiving the last one, we will put all of them into the staging list of the staged cleanup. The job then needs to be submitted within "blob.retention.interval" seconds or we cannot make any guarantees that the jar files still exist. This ensures a proper cleanup during client aborts/crashes between the upload and the job submission.

Files are first uploaded to the local store and then transferred to the HA store. The latter may be optimised to run in parallel but we may only acknowledge the upload once both are written (if HA is configured).

BlobServer Download

Similarly to the BlobCache, we will first try to serve a file from local store (checksum-verified) and, if it does not exist, create a local copy from the HA store (if available) - see above.

BlobServer Recovery (HA)

During recovery, the JobManager (or the Dispatcher for FLIP-6) will:

  • fetch all jobs to recover
  • download their BLOBs lazily and increase reference counts appropriately (at the JobManager only after successful job submission)
  • put any other, i.e. orphaned, file in the configured storage path into staged cleanup

Use Cases Details

Jar files

User-code jar files are uploaded before submitting a job by the job submission client. After successfully uploading all jars, the job is submitted and the JobManager/Dispatcher will increase the reference count at the BlobServer by 1. It will be decreased when the job enters a final state in which case the <jobId> directory will be deleted anyway. The BlobCache only needs to reference-count the jars in its local store, no further interaction is needed.

RPC Messages

An RPC message may be off-loaded into the BlobServer during job submission or at any time in a job's lifecycle. It may be sent to multiple receivers. In contrast to jar files, we expect messages to be re-creatable, i.e. in case of a recovery, we do not necessarily need these BLOBs to be available and effectively only use the HA store for leveraging its distributed file system. We therefore use the BlobServers's non-HA lifecycle guarantee for these.

For a message within a job's lifecycle, we want to be able to delete (temporary) ones when all receivers have successfully downloaded the message. Therefore, during message upload, an offloading-capable RpcService will set the reference counter to the number of receivers. Upon successful download and deserialisation of the message on the receiver's side, it will not only reduce the BlobCache's refCount but also acknowledge towards an RpcService at the BlobServer that the message has been received. This will reduce the BlobServer's refCount and eventually lead to the message being deleted. If the job fails, all reference counts are re-set to 0 and these files are thus subject to staged cleanup.

Special handling is required for an off-loaded job submission message: If we set the refCount to 1 immediately, we would not have a safety net if the job was never submitted. Therefore, we will use the same technique as for the jar files and upload with an initial refCount of 0 so that if the job submission RPC message itself (pointing to the BLOB) arrived within "blob.retention.interval" seconds we could guarantee that the BLOB still exists.

Log Files

Log files are currently only used by the Web-UI to show TaskManager logs. They are downloaded upon request and served afterwards. Each download should decrease the previous log's refCount by 1 and increase the new log's refCount by 1. Logs have non-HA lifecycle guarantees and may even be deleted immediately instead of putting them into staged cleanup.

As an optimisation, instead of transmitting the same log parts over and over again, we may support uploading log file partitions, i.e. byte xxxx-yyyy, as BLOBs and use them in the WebUI. This is something agnostic to the BLOB store however and is supported by the architecture above.

Compatibility, Deprecation, and Migration Plan

...