Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

These fixes only get the pipeline running again but the inflight intermediate stream data and any samza managed state that was reset is consequently lost. Aside from the data loss, this process of a pipeline upgrade is evidently cumbersome. Therefore, it is important to support the ability to drain and stop a pipeline smoothly prior to making an upgrade which involves an incompatible intermediate schema change. 

The work here is scoped to drain only intermediate stream (shuffle) data and system-managed state. The onus of clearing user state like LocalTable, if required, is on the user. System managed state comprises of window aggregation state and timers. The proposed approach will prematurely emit window panes and fire all timers , thereby, draining all state. Continuation of state post drain is beyond the scope of this work.

Proposed Changes

We propose a drain operation for Samza pipelines which halts the pipeline from consuming new data, completes processing of buffered data and then stops the pipeline.

...

Samza supports two notions of time. By default, all built-in Samza operators use processing time. Samza supports event-time operations through the Beam API. Whenever the user signals a drain, the watermark is advanced to infinity through watermark control messages. Beam’s samza runner keeps a local rocksdb state for timers and window state. Timers will be fired and window data is processed as a consequence of the watermark being advanced to infinity.

Advancing the watermark doesn't have any effect in Samza since windowing is processing-time based. Samza high-level API supports window operations on MessageStream. It keeps a track of window data in local rocksdb state and tracks the triggers in-memory. When the window operator receives drain, all the triggers will fire and results will be emitted from the window operation. This is implemented by overriding the handleDrain in WindowOperatorImpl.

Scope of the approach

...

the watermark is advanced to infinity through watermark control messages. Beam’s samza runner keeps a local rocksdb state for timers and window state. Timers will be fired and window data is processed as a consequence of the watermark being advanced to infinity.

Advancing the watermark doesn't have any effect in Samza since windowing is processing-time based. Samza high-level API supports window operations on MessageStream. It keeps a track of window data in local rocksdb state and tracks the triggers in-memory. When the window operator receives drain, all the triggers will fire and results will be emitted from the window operation. This is implemented by overriding the handleDrain in WindowOperatorImpl

...

.

Implementation and Test Plan

...