Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Adjunct data store is particularly useful to store data delivered through a database change capture (CDC) system from a database. It is important to understand the capabilities of this feature and its limitations. In general, best attempt will be made to provide consistency at different levels without sacrificing too much other aspects. Sections below will discuss consistency guarantees provided for unbounded and bounded datasets. However the solution for bounded dataset is not included in the scope of this proposal, later changes are quite possible.

Unbounded dataset

For unbounded dataset, the concept of version isn't necessary as all messages from a stream partition will end up in the same store. 

of the dataset is naturally consistent across containers (as there is only one version). If an stream is unpartitioned, we provide a consistent snapshot within a container. If a stream is partitioned, we provide a consistent snapshot of a version within a task.

Unbounded dataset

When the content of a database is delivered by a CDC system, the size of dataset i.e. the database is limited, however its changes can continue forever, therefore unbounded from the receiver's perspective. The concept of versioning isn't necessary. However, it is important for adjunct data store to be able to provide a snapshot of the database, this can be achieved as long as the CDC system provides ordering and at least once delivery guarantees. Regardless if a steam are partitioned or not, we guarantee consistency at container level, i.e. once bootstrap is complete the adjunct data store can be treated as a snapshot (or a fraction of a database) of the database within a container. No guarantee is provided at job level.

 

Bounded dataset

When the sources are read-only files, for example a machine learning model, they are by nature size-bounded. However, we should expect new versions of the dataset to be produced over time. It is desirable to be able to incorporate new versions without interrupting current operation. Similar to unbounded dataset, a copy of a set of files can be produced by bootstrap process, and thereafter processing of main input follows. The requires the delivery system (system connector) to be able to inject markers in a stream to signal the end of a dataset. When an adjunct data store sees the marker, it knows the current dataset is complete, and it can "seal" the store and prepare for the next version. Any data coming thereafter would be stored in the next version. While building a new version, an adjunct data store continues to serve the current version, after the new version is built, it switches to the new version and discards the old version. This can work seamlessly and user would never see two versions at the same time. 

When a file is delivered through one stream (unpartitioned), we guarantee a consistent snapshot (copy) of the entire file at container level; when a file is delivered through multiple streams (partitioned), we guarantee a consistent fraction of the snapshot at task level. No guarantee is provided at job level.

For record based datasets that are addressable using keys, at least once delivery semantics is sufficient. However, if the underlying dataset is not K/V in nature, exactly once semantics might be needed.

As the solution for bounded dataset is not out of scope of this proposal, this is still subject to future changesNo consistency is offered across containers.

Bootstrap

When an AD stream is marked as bootstrap, it guarantees that an initial snapshot is built before processing of input streams starts; otherwise input stream and AD streams are processed at the same time. After bootstrap, for change capture data, we keep updating its AD store when new updates arrives.

...