You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Status

Current state: [ UNDER DISCUSSION | ACCEPTED | REJECTED ]

Discussion thread<link to mailing list DISCUSS thread>

JIRA: SAMZA-TBD

Released: 

Problem

Samza today supports RocksDB and MemDB as local data stores, which enables users to cache data for later usage during stream processing. However, the population of a data store is end user’s responsibility. This introduced complexity of maintaining local data stores, especially corner cases such as reload after consumers falling off. To avoid these issues, some people employed alternative solutions such as Voldemort, CouchBase, etc. In addition, table oriented operations of fluent API would require working data to be made available by the system. As we look at the issue more closely, it appears generic enough to be addressed by data infrastructure.

Source streams can be either bounded (such as files) or unbounded with different challenges associated. This proposal focuses on unbounded source streams.

Motivation

We want to have an adjunct data (AD) store that is a read-only cache. It automatically stores streaming data for later usage. Adjunct data can be accessed the same way as accessing a key-value store in Samza, in addition we guarantee a consistent view of data from a Samza task’s perspective. Data can be either partitioned or unpartitioned. If the dataset is small enough to fit in a RocksDB instance, the same copy would be populated in every container via a broadcast stream; if it is large enough fit in one database instance it would be partitioned across containers of a Samza job. 

 

Theoretically an AD store could be either local (RocksDB and MemDB) or centralized (CouchBase), however we believe the use of a centralized data store is more of a side effect of the lack of a local adjunct data store. For now we defer the support of a centralized adjunct data store until we see clear evidence.
Having adjunct data store potentially enables a number of use cases
  • Automatic maintenance of local cache
  • Table oriented operations for fluent API

 

Proposed Changes

 

 


Public Interfaces


Implementation and Test Plan


Compatibility, Deprecation, and Migration Plan

 

Rejected Alternatives

 

  • No labels