Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

- The change made in this proposal is both source backward-compatible and binary backward compatible. Their code can compile and run correctly without change.
- For users who want to enable partition expansion for its input streams, they can do the following:
  - Set grouper class to GroupByPartitionWithFixedTaskNum if the job is using GroupByPartition as grouper class
  - Set grouper class to GroupBySystemStreamPartitionWithFixedTaskNum if the job is using GroupBySystemStreamPartition as grouper class 
  - Change their custom grouper class implementation to extend the new interface if the job is using a custom grouper class implementation.
  - Set job.coordinator.monitor-partition-change to true in the job configuration
  - Run ConfigManager

Rejected Alternatives

1. Allow task number to increase instead of creating a new grouper class.

...

The current proposal relies on the input system to meet the specified operational requirements. While these requirements can be satisfied by Kafka, they may or may not be satisfied by other systems such as Kinesis. We can support partition expansion for more input systems than Kafka if user is able to express the new-partition to old-partition mapping strategy. However, since we currently don't know how user is going to use such config/class to do it, we choose to keep the current SEP simple and only add new config/class when we have specific use-case for them.

 

 

 

4. Use additional repartitioning stage to repartition data from input stream to another internal stream of the old partition count

Here are the pros and cons of the extra re-partitioning stage in comparison to the current proposal.
Pros:
- It doesn't require owner of the Samza job to know the partitioning algorithm of used for the input stream. If the owner of the Samza job is in a different organization than the producer of the input stream, this solution frees different organizations from having to coordinate with each other.
- It doesn't require owner of the Samza job to specify the partitioning algorithm of used for the input stream. Thus less config.
Cons:
- User has to make code change on their side to use the new fluent API.
- The extra partitioning stage would potentially increases latency.
- The extra partitioning stage would incur additional cost due to the extra internal topic. The cost is probably not that much with the new trim() API in Kafka if Samza uses Kafka to store the internal topic. But the cost may be doubled if Samza uses another input system that doesn't provide trim() API to delete data on demand.
It seems reasonable to adopt a hybrid solution, i.e. we still implement the current proposal in SEP-5 so that we enable partition expansion without incurring extra latency/cost and without requiring users to change their code. And user can use the extra partitioning stage if the coordination among different organization is indeed a concern.

Future work

1. Enable task number expansion in Samza

2. Add new config and class for user to specify the new-partition to old-partition mapping strategy based on the input system.