Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Implementation and Test Plan

PhaseImplementation 
Phase 1
  1. Implement Blob Store Snapshot structure: Directory, File, Snapshot index schemas. 
  2. Implement diff algorithm to calculate backup and restore delta.
  3. Implement Util classes that abstracts Put/Get/Delete at File and Directory level.
Phase 2
  1. Implement backup and restore classes for Blob Store
  2. Implement and introduce metrics for backup and restore flows.


Test ScopeTest Scenarios
Flows to test
  1. Backup of a new store
  2. Restore on a new host
  3. Incremental backup/restore of a store
  4. Cleanup for a new container/job 
  5. Cleanup called after a failed container restart
  6. Cleanup called during commit sequence
  7. Partial failure of backup in different stages of commit lifecycle (commit, upload, persist, cleanup).
    1. Ensure that the partial state is cleaned up.
  8. Failure scenarios of restore in different stages of commit lifecycle (commit, upload, persist, cleanup).
    1. Ensure that a container restarting on the same host after partial failure can cleanup/recover and restore correctly (does not get stuck).
  9. Failure scenarios of blob store (missing index blobs, file blobs, retry etc.)
  10. Test cleanup/commit sequence during taskinstance init works with transactional kafka and blob store
End-to-end testing
  1. Test that checkpoint read/write version configurations are validated before job launch.
  2. Test upgrade and rollback compatibility for samza versions with and without blob store backend.
  3. How do we enforce that backup and restore managers aren't both enabled for the first deployment?
BlobStoreTaskBackupManager


  1. init 

    1. init with checkpoint V1

    2. init with no/null checkpoint 

    3. Test init cleans up unused stores correctly.

  2. upload

    1. No previous checkpoint (first upload)

    2. Previous checkpoint passed during init (subsequent upload)

    3. Test upload handles logged / non-logged / persistent / durable stores correctly (document expectation here). 

    4. Test upload calculates diff from previous checkpoint correctly (during initial start and during post-startup commits)

    5. Test upload returns snapshot blob id and records previous snapshot blob id in the snapshot correctly.

  3. cleanup

    1. Test Cleanup removes TTL of remote snapshot and associated files

    2. Test Cleanup deletes old remote snapshot

    3. Test Cleanup deletes files/subdirs to remove from current checkpoint

    4. Test Cleanup cleans stores removed from config

    5. Cleanup failed container/job restart

BlobStoreRestoreManager
  1. init

    1. Test init fails for checkpoint V1.

    2. Test init works for no/null checkpoint.

    3. Test init returns blob store backend store scms if present in checkpoint.

    4. Test util method to get snapshot indexes from checkpoint.

    5. Test container fails to start with meaningful error message if init fails.

  2. restore

    1. Test that restore restore to the correct store directory depending on store type.

    1. Test that it ignores any files that are not present when upload is called (e.g. offset files).

    1. Test restore handles logged / non-logged / durable / persistent stores correctly.

    2. Test logic for checking if checkpoint directory is identical to remote snapshot.

    3. Test restore handles stores with missing SCM in checkpoint correctly.

    4. Test restore handles multiple stores correctly.

    5. Test restore always deletes main store dir.

    6. Test restore uses previous checkpoint directory if identical to remote snapshot.

    7. Test restore restores from remote snapshot if no previous checkpoint dir.

    8. Test restore restores from remote snapshot if checkpoint dir not identical to remote snapshot.

    9. Test restore recreates subdirs correctly.

    10. Test restore recreates recursive subdirs correctly

    11. Test restore creates empty files correctly.

    12. Test restore creates empty dirs correctly.

    13. Test restore creates empty sub-dirs / recursive subdirs correctly.

    14. Test restore restores multi-part file contents completely and in correct order.

    15. Test restore verifies checksum for files restored if enabled.

BlobStoreStateBackendUtil
  1. Test throws exception for checkpoint v1.

  2. Test no-op for null / empty checkpoint.

  3. Test works correctly for missing blob store backend factory entry.

  4. Test works correctly for missing blob store backend factory store entry.

  5. Test throws exception on sync and async blob store errors.

  6. Test gets the right blobid from remote store.

  7. Test returns the correct pair of scm and snapshot index.

  8. Test blocks once at the end for all futures instead of blocking for each store.

Concurrency and Retries
  1. Test CompletableFutureUtil methods.
  2. Test that all operations use an explicit and expected executor (no default executor).
  3. Verify future composition (allOf, toMap etc) and blocking (individual vs collected vs nonblocking) for all async methods.
  4. Verify that there is no blocking on caller threads. Document and justify exceptions (e.g. restore thread)
  5. Test BlobStoreManager Impl/BlobStoreUtil error handling and retries.
    1. Test completionexception unwrapping to identify actual cause.
    1. Test callback order (par/seq dep graph) for all chained operations.
    2. Test async retriable exceptions are transformed correctly.
    3. Test/verify put / get / delete futures always complete (handle sync / async errors correctly).
    4. Test retries for get / put create new input / output streams.
    5. Test error handling for get (sync, future, callback errors).
  6. Test TaskInstance commit flow.
    1. Test async commit stage fails if upload/checkpoint write/cleanup fails.
    1. Verify all async stage operations execute on a separate threadpool.
    2. Test async commit succeeds and ublocks future commits if all async operations succeed.
    3. Test async commit stage fails if any async operations failed.
    4. Verify async commit stage operations are chained correctly.
    5. Test exceptions in asyc commit stage are propagated to next sync commit stage.
    6. Test sync commit fails if a previous async commit fails.
    7. Test commit skips if previous async commit in progress and < max delay.
    8. Test commit blocks if previous async commit in progress and > max delay
    9. Test that sync commit times out if previous async commit does not complete within max commit delay.
  7. Test BackupManager/RestoreManager flow
    1. Test all async stage operations execute on a separate threadpool.
    2. Verify/Test error propagation, handling and operation chaining (par/seq dep graph).
    3. Verify/Test timeouts for blocking operations. Document and justify blocking operations.
    4. Test handling of retriable / ignorable (410s) / unrecoverable errors.
    5. Verify/Test idempotency of cleanup / delete / ttl operations.

Compatibility, Deprecation, and Migration Plan

...