Overview

This page gives an overview of data (re)processing scenarios for Kafka Streams. In particular, it summarizes which use cases are already support to what extend and what is future work to enlarge (re)processing coverage for Kafka Streams.

Use this page as a feature wish list for data (re)processing.

 If your use case is missing, feel free to add it (let us know if you need access credentials: http://kafka.apache.org/contact)! Any issue you observe with regard to (re)processing can only be fixed if the community is aware that there is an issue. Just describe your scenario with expected behavior (there is no need to provide a solution for adding new scenarios). It would of course also be helpful if you describe why your scenario is currently not covered.

 

Use Case Scenarios

In the list below, each scenario is described in detail with regard to use case, expected behavior, available tooling, and best practice guidelines. The goal is to provide an comprehensive overview and step-by-step guideline for all kind of (re)processing scenarios.

The table is color coded as follows:

  • green for supported scenarios
  • yellow for scenario that are not fully supported
    • missing tooling, but scenario is clear
    • manual workaround available
  • red for not supported scenarios
    • hard to solve

 

ScenarioUse CasesExpected Behavior / RequirementsAvailable Tooling

Step-by-Step Guideline/

Best Practice

Limitations/

Known Issues

External Resources
Data reprocessing from scratch
  • development and testing
  • rollback after bug fixes in production
  • A/B testing
  • demoing for customers or other stakeholders
  • replay for new business logic (Kappa architecture)
  • After running and stopping an application you want to reset your application back to "zero".
  • Thus, on restart, the application reprocesses your data the same way as in its original run (assuming that the original input data still exists in Kafka in its entirety).

 

Requirements:

  • Application must start consuming input topics from scratch (no committed offsets)
  • The application's internal state must be empty
  • Auto-created created topics must be empty (or deleted)
  1. stop all running application instances
  2. if required:
    1. delete and re-create output topics manually
    2. use different/new output topics
  3. run application reset tool
  4. before restart, make sure to call KafkaSteams#cleanUp() for each application instance
  • all data from input topics must still be available
    (i.e., no input data is lost due to log retention or compaction)
  • no support to handle output topics:
    • by default, new application run appends data to originally used output topics
    • manual fixed:
      • delete and recreate output topic manually
      • change application and use different/new output topics
Data reprocessing with specific starting point
(reprocessing from scratch; i.e., empty state) 
  • partial rollback after bug fixes in production
  • A/B testing

Similar to "Data Reprocessing from Scratch". However, instead of restarting the application at offsets zero, the user wants to specify a specific starting point.

 

Requirement:

  • Same as "Data Reprocessing from Scratch"
  • Allow user to specify a (valid/consistent) starting point (offsets?, timestamp?)

Missing:
API/tooling to set starting point. 

Similar to "Data reprocessing from scratch".

Manual workaround:

Use a consumer client to seek() to desired starting offsets and commit() than. This step must be done after the reset tool was used and before the application gets restarted.

  • see "Data Reprocessing from Scratch"

 

Data reprocessing using old application state
  • A/B testing with stateful start
  • rollback after bug fix in production (application was redeployed include a bug at time X, go back to X and reprocess data with fixed application)

Requirement:

  • New application needs (historical) state of old application at point X.
   
Processing cold data
  • development
  • A/B testing
  • processing cold/old/offline topics (i.e., process topics that do not have active producers)

  • application stops automatically after it processed all available data

Requirement:

  • application should have an auto-stop feature (KIP-95)
 

Workaround

Manual stop required at the moment:

  1. monitor consumer lag via bin/kafka-consumer-groups.sh
  2. when consumer lag is zero, stop application manually

 

  
Incremental processing (time driven)
  • "batch like" processing
  • start application in regular intervals (like cron job) and application automatically stops processing after a processing data for a specific time (wall-clock)
Not required.
  • Put a sleep() after application startup and close application after sleep-time passed. To make it robust for failure restart, sleep() should not get a hard coded parameter passed in, but rather the difference to endTime - startupTime.

or

  • Run app "forever" as for regular stream processing case and terminate application from outside when "stop time" is reached.
  • not very precise with regard to event-time processing (i.e., stopping point is not related to application progress)
 
Incremental processing (data driven)
  • "batch like" processing
  • start application in regular intervals (like cron job)
  • application stops automatically at some point
  • on application restart, it resumes from previous run
  • while application is running, new data might be appended to input topics

Requirement:

  •  application must have an auto-stop feature (KIP-95)
 

 Workaround

  • follow approach for "Incremental processing (time driven)"
  • processing elapse time must be shorter than startup interval (i.e., start processing each hour, processing takes less than an hour)
Offline application upgrade
  • application bug fixes / improvements in production
  • an application should be replaced with a newer version
  • new version resumes where old version left off
  • no reprocessing of old data
Not required.
  • stop all running application instances
  • start new version of your application (same application.id)

New and old application must be "compatible".

Compatible changes:

  • changing a filter condition
  • inserting a new filters/map (record-by-record operation)

Incompatible changes:

  • changing the structure of topology DAG
  • changing data types of stateful operations (like aggregations / joins)
  • works only if application downtime is acceptable
  • new application must have similar structure than old one
  • Only newly produced output is "fixed"
 

Online application upgrade

  • application bug fixes / improvements in production
  • downstream application consumer data live and are not interesting in "correcting" previous result (because computation happened already and there is no interest in "correcting" old stuff)
  • an application should be replaced with a newer version
  • new application is deployed in parallel
  • when the new application is "ready to take over", the old application is stopped
  • new application might start from an older offset and reprocess some data (w/ or w/o initial state)
    
Reprocessing of "historical" data
  • reprocess all data from yesterday / last week / April
  • "batch like" processing
  • old data should be reprocessed (new version of application or completely different application)
  • result must be exact with regard to even-time (i.e., not include any older data and also take late arrivals into account)
  • new result might replace old results (i.e., update downstream database)
    

 

 

 

 

  • No labels

2 Comments

  1. Comment from user mailing list:

    1) Pause and resume without having to start and stop.
    It should drain all the inflight calculations before doing the actual pause
    and a notifier will be helpful that it is actually paused.
    This can be much light wait if possible rather than stopping all tasks and
    stores and starting them again which might take a lot of time
    
    
    2) Metadata api response if it can have all the topology graphs and sub
    graphs with all the nodes and edges for each sub graphs with corresponding
    state store names then it will be easier to build some UI on top of it and
    also the pipe between the sub graphs which are kafka topics need to be
    called out and also the time semantics can be laid out on top of it. This
    is something like a logical view on top of the physical view which the
    current api has.
  2. Comment in kafka-dev by Avi Flax (Dec 12, 2016):

    * Two use cases that are of particular interest to me:

        * Just last week I created a simple Kafka Streams app (I called it a
          “script”) to copy certain records from one topic over to another,
          with filtering. I ran the app/script until it reached the end of
          the topic, then I manually shut it down with ctrl-c. This worked
          fine, but it would have been an even better UX to have specified
          the config value `autostop.at` as `eol` and have the process stop
          itself at the desired point. That would have required less manual
          monitoring on my part.

        * With this new mode I might be able to run Streams apps on AWS Lambda.
          I’ve been super-excited about Lambda and similar FaaS services since
          their early days, and I’ve been itching to run Kafka Streams apps on
          Lambda for since I started using Streams in April or May. Unfortunately,
          Lambda functions are limited to 5 minutes per invocation — after 5
          minutes they’re killed. I’m not sure, but I wonder if perhaps this new
          autostop.at feature could make it more practical to run a Streams
          app on Lambda - the feature seems like it could potentially be adapted
          to enable a Streams app to be more generally resilient to being
          frequently stopped and started.