This document is intended to summarize various use cases for punctuate.

Terminology

Before going into the details let us first clarify the terms and their meanings for the purpose of this discussion.

TermDescription
Stream Time

KafkaStreams has defined an interface called TimestampExtractor which can be implemented to extract the Event Time from the event contents, the ingestion time, the processing/wall-clock time, or any other logical time.

Stream Time for us would be the value returned by the TimestampExtractor implementation in use.

Punctuate TimeThis is the reference time used to trigger the Punctuate calls. It could be Stream Time [current semantics], System Time, or Stream Time with an expiry timer based on System Time (hybrid or complex punctuate).
Punctuation TimestampThe reference timestamp passed to punctuate method being called. It could be Stream Time of the event that triggered the punctuate [current semantics], System Time in case of System Time based punctuate, or a PunctuationTime object specifying the details in case of a hybrid/complex punctuate.
Output Record TimeThis is the output record timestamp for the events generated in the punctuate method. Currently, it is the internal time of the Stream Task which is the Stream Time.

Punctuate Use Cases

Use Case 1

Quoting the words from Jay Kreps' mail correspondence. 

"You aggregate click and impression data for a reddit like site. Every ten minutes you want to output a ranked list of the top 10 articles ranked by clicks/impressions for each geographical area. I want to be able run this in steady state as well as rerun to regenerate results (or catch up if it crashes)."

BehaviourRequirement
Deterministic output
(reprocessing or delayed processing possible)
(tick)
Output in the absence of events(tick)

 

Use Case 2

In an Event Count Audit system, we want to count the events produced every minute from the producers, recount the events at the first cluster, and at the subsequent mirror clusters at a per minute per producer granularity. We need to account for some late events and also the out of order event times in the same topic.

 

BehaviourRequirement
Deterministic output
(reprocessing or delayed processing possible)
(tick)
Output in the absence of events(tick)

Evaluation Matrix

In the currently listed use cases the Stream time is the Event Time.

 

Behaviour

Stream Time based Punctuate

System Time based Punctuate Stream Time based Punctuate with
System Time based secondary expiry
Deterministic output
(reprocessing or delayed processing possible)
(tick)(error)(tick)
Output in the absence of events(error)(tick)(tick)
Join of same time events across different topics(tick)(error)(warning)
Correct punctuation in the presence of long queueing(tick)(error)(warning)
    

Alternate Representation

Design Semantics Use Case 1Use Case 2

Stream Time based Punctuate

Pros

Deterministic output (reprocessing or delayed processing doesn't change the output)

Deterministic output (reprocessing or delayed processing doesn't change the output)
Cons

Low event geographic areas might not generate a list every 10 minutes. 

Event count output wont happen until an event with event time crossing the punctuate interval arrives.

System Time based Punctuate 

ProsRank list output every 10 minutes, irrespective of event flow.Count flushes in the absence of events 
ConsOutput during a reprocessing won't be same.

Late event will cause another aggregate event during steady state operation.
Burst output of aggregate events at the punctuate times during reprocessing.

Stream Time based Punctuate with System Time based secondary expiryProsDeterministic output (reprocessing or delayed processing doesn't change the output)
Low event geographic areas will generate a list as per the expiry timer.

Deterministic output (reprocessing or delayed processing doesn't change the output)
Count flushes in the absence of events 

Cons?

In case of similar scenarios where processing time is more leading to queueing and late processing, the system timer may expire before the last even within the window arrives.
In case of similar scenarios where events have to be joined across different topics, we might actually need to wait for the event.

    
    
  • No labels