Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Status

Current state: UNDER DISCUSSION Accepted

Discussion thread<link to mailing list DISCUSS thread>http://mail-archives.apache.org/mod_mbox/samza-dev/201802.mbox/%3CCAFvExu1GHnphidP_wRriMey-T7Hss4AqAxscOoBFUHuMR5sq%3DQ%40mail.gmail.com%3E

JIRA

Jira
serverASF JIRA
serverId5aa69414-a9e9-3523-82ec-879b028fb15b
keySAMZA-1554

...

  1. Support stateful stream processing in standalone stream applications.
  2. Minimize partition movements amongst stateful processors in the rebalance phase.

Non Goals

  1. In the embedded samza library model, users are expected to perform manual garbage collection of unused local state stores(to reduce the disk footprint) on nodes.

Proposed Changes

Overall idea  intent behind this approach is to encapsulate the host aware task assignment to processors logic as a part of JobModel generation(specifically as a part of TaskNameGrouper implementation) in standalone. With existing host affinity implementation in samza-yarn, this happens outside of the JobModel generation(specifically in a ContainerAllocator implementation). The trouble with replicating this outside of JobModel generation in standalone(in the leader layer) is that, it creates an abstraction boundary spill over to the higher level layer which shouldn’t concern itself with intricacies/details of the task assignment to stream processors.

If an optimal assignment for each task to a particular processor is generated in the JobModel as part of the leader in a stateful processors group, each follower will just pick up their assignments from job model after the rebalance phase and start processing(similar to non-stateful jobs). The goal is to guarantee that the optimal assignment happens which minimizes the task movement between the processors. Local state of the tasks will be persisted in a directory(local.store.dir) provided through configuration by each processor. 

 

High level flow in standalone with HostAffinity

Standalone host affinityImage Removed

...

  1. Existing generators discard the task to physical host assignment when generating the JobModel and only uses processor to preferred host assignment. However, for standalone it’s essential to consider this detail(task to physical host assignment) between successive job model generations to generate optimal task to processor assignment. For instance, let’s assume stream processors P1, P2 runs on host H1 and processors P3, P4 runs on host H3. If P1 dies, it is optimal to assign some of the tasks processed by P1 to P2. If previous task to physical host assignment is not taken into account when generating JobModel, this cannot be achieved.
  2. In an ideal world, any TaskNameGrouper should be usable interchangeably between yarn and standalone deployment models. Currently only a subset of TaskNameGrouper’s usable in yarn  are supported in standalone.

Non Goals

  1. In the embedded samza library model, users are expected to perform manual garbage collection of unused local state stores(to reduce the disk footprint) on nodes.
  2. Monitoring and handling the increase/decrease of input stream partitions of a stateful standalone stream application is out of scope for this feature.

Proposed Changes

JobModel is the data model in samza that logically represents a samza job. The JobModel hierarchy is that samza jobs have one to many containers(ContainerModel), and each container has one to many tasks(TaskModel). Each data model contains relevant information, such as logical id, partition information, etc. Existing host affinity implementation in yarn is accomplished through the following two phases:

  • JobModel generation phase: ApplicationMaster(JobCoordinator) in yarn deployment model generates the Job model(container to task assignment) for the samza job. 
  • ContainerAllocator phase: This happens after the JobModel generation phase and schedules each container to run on a physical host by coordinating with the underlying ClusterManager and orchestrates the execution of the processor.

Here’re the list of important and notable differences in processor and JobModel generation semantics between yarn and standalone deployment model:

  • Number of containers is a static configuration in yarn deployment model and a job restart is required to change it. However, an addition/deletion of a processor to a processors group in standalone is quite common and an expected behavior.
  • A container is assigned a physical host by ContainerAllocator after the JobModel generation phase in yarn. Physical host in which a processor is going to run is known before the JobModel generation phase in standalone(ContainerAllocator phase is not needed in standalone to associate the processor with the physical host).

Overall high level changes:

  • Deprecate the different existing flavors of the TaskNameGrouper implementations(each one of them primarily grouping TaskModel into containers) and provide a single unified contract. The common layer between yarn and standalone model is the TaskNameGrouper abstraction(which is part of JobModel generation phase) which will encapsulate the host aware task assignment to processors. In the existing implementation, only the processor locality is used to generate the task to processor assignments. In the new model, both the last reported task locality and processor locality of a stream application will be used when generating task to processor assignments in both the yarn and standalone models.
  • Introduction of MetaDataStore abstraction to store and retrieve processor and task locality for different deployment models in appropriate storage layers. Kafka be will be used as locality storage layer for yarn and zookeeper will be used as storage layer for standalone.
  • A new abstraction LocationIdProvider is introduced as a part of this change to generate locationId for a physical execution environment. All the processors of an application registered from an locationID should be able to share(read/write) their local state stores. Any store created by a processor running from a locationId should be readable/writable by other processors running from the same locationId. Any custom LocationIdProvider is expected to honor this contract when generating the locationID. Here’re few reasons for introducing a new abstraction to generate locationId rather than using processorID as locationId.
    • LocationId denotes the physical execution environment required to run a stream processor. LocationId is used to uniquely identify a environment amongst all available physical execution environments. ProcessorId is used to uniquely identify a stream processor in a processors group. ProcessorId and localityId are two different, logically orthogonal concepts which cannot be unified.

    • Standalone model supports running multiple stream processors from a single JVM on a physical host. If a stream processor running a physical host dies, it’s optimal to redistribute the tasks of the dead processor to the other processors running on the host. If processorId is used as localityId, this optimal generation cannot be achieved(since task to localityId association is not maintained).

    • In case of LinkedIn execution environment, locationId will be a composite key comprised of sliceID and sliceInstanceId. In case of kubernetes, locationId will be containerId(which will be obtained through POD API).

 

Image Added

Zookeeper is used in standalone for coordination between the stream processors of a stream application. Amongst all the available processors of a stream application, a single processor will be elected as a leader in standalone. In the standalone deployment model, the JobModel is stored in zookeeper. The leader will generate the JobModel and propagate the JobModel to all the other processors in the group. Distributed barrier in zookeeper will be used to block the message processing until the latest JobModel is picked by all the processors in the group. 

Image Added


ZK Data Model to support host affinity:

In standalone, locality information of the stream processors will be stored seperately from the JobModel. JobModel will be used to hold just the task assignments(processor to task assignment and task to system stream partition assignment) alone in standalone. In standalone, each stream processor during it's startup phase will store the physical host on which it runs from into a appropriate zookeeper locality node(This is synonymous to existing behavior in yarn). MetadataStore abstraction will be used to read and write stream processor locality information for different deployment models in appropriate storage layers. There will be two implementations of MetadataStore viz CoordinatorStreamBasedMetadataStore to read/write processor locality information into coordinator stream(a kafka topic) for yarn and ZkMetadataStore to read/write processor locality information in zookeeper for standalone. Local state of the tasks will be persisted in a directory(local.store.dir) provided through configuration by each processor.

Code Block
languagejava
- zkBaseRootPath/$appName-$appId-$JobName-$JobId-$stageId/
    - processors/
        - processor.000001/
            locatoinId1(stored as value in processor zookeeper node)
        - processor.000002/
            locationId2(stored as value in processor zookeeper node)
        ...
        - processor.00000N/
            locationIdN(stored as value in processor zookeeper node)
    - jobModels/
        - {version}
            JobModelObject(stored as value in jobmodels version zookeeper node)
    - barriers/
        - {version}
            barrier_state(stored as value in barriers version zookeeper node)
    - localityData
        - task01

...

Image Removed

ZK Data Model to support host affinity:

After the rebalancing phase, before the start of processing each stream processor will register the details of physical host on which it runs in the localityData zookeeper node. The goal here is to separate the locality information from the JobModel itself (JobModel will be used to hold the task assignments). LocalityManager abstraction will be used to read and write locality information for different deployment models in appropriate storage layers. There will be two implementations of LocalityManager viz CoordinatorStreamBasedLocalityManager to read/write container locality information for yarn and ZkLocalityManager to read/write container locality information for standalone. In case of standalone, last known physical host in which each  samza task had run will be stored in zookeeper, which will then be used to assign tasks to stream processors. Stream processor will update the task locality of the tasks assigned to it before it begins processing(This is synonymous to behaviour in yarn, where locality is updated in SamzaContainer as a part of startup sequence).

Code Block
languagejava
- zkBaseRootPath/$appName-$appId-$JobName-$JobId-$stageId/
    - processors/
        - processor.000001/
            locatoinId1(stored as value in processor zookeeper node)
        - processor.000002/
            locationId2(stored as value in processor zookeeper node)
        ...
        - processor.00000N/
            locationIdN(stored as value in processor zookeeper node)
    - jobModels/
        - {jobModelVersion}
  locationId1(stored as value in task zookeeper node)
    JobModelObject
    - barrierstask02/
        - {jobModelVersion}
  locationId2(stored as value in task zookeeper node)
    barrier_state  
    - localityData... 
        - task01task0N/
           locationIdN(stored as value in task  locationId1(stored as value in task zookeeper node)
        - task02/
           locationId2(stored as value in task zookeeper node)
        ... 
        - task0N/
           locationIdN(stored as value in task zookeeper node)

Local store sandboxing:

In standalone landscape, the file system location to persist the local state should be provided by the users through stream processor configuration(by defining local.store.dir configuration). The configuration `local.store.dir` is expected to be preserved across processor restarts to reuse preexisting local state. It’s expected that the stream processor java process will be configured by user to run with sufficient read/write permissions to access the local state directories created by any processor in the group. The local store file hierarchy/organization followed in samza-yarn deployment model for both high and low level API will be followed in standalone.

Remove coordinator stream bindings from JobModel: 

JobModel is a data access object used to represent a samza job in both yarn and standalone deployment models. With existing implementation, JobModel requires LocalityManager(which is tied to coordinator stream) to read and populate processor locality assignments. However, since zookeeper is used as JobModel persistence layer and coordinator stream doesn’t exist in standalone landscape, it’s essential to remove this LocalityManager binding from JobModel and make JobModel immutable. Any existing implementations(ClusterBasedJobCoordinator, ContainerProcessManager) which depends upon this binding for functional correctness in samza-yarn, should directly read container locality from the coordinator stream instead of getting it indirectly via JobModel.

Cleaning up ContainerModel:

 ContainerModel is a data access object used in samza for holding the task to system stream partition assignments which is generated by TaskNameGrouper implementations. ContainerModel currently has two fields(processorId and containerID) used to uniquely identify a processor in a processors group. Standalone deployment model uses processorId and Yarn deployment model uses containerId field to store the unique processorId. To achieve uniformity between the two deployment models, the proposal is to remove duplicate containerId. This will not require any operational migration.

State store restoration:

 Upon processor restart, nonexistent local stores will be restored using the same restoration sequence followed in yarn deployment model.

Container to physical host assignment:

When assigning tasks to a stream processor in a run, the stream processor to which the task was assigned in the previous run will be preferred. If the stream processor to which task was assigned in previous run is unavailable in the current run, the stream processors running on physical host of previous run will be given higher priority and favored. If both of the above two conditions are not met, then the task will be assigned to any stream processor available in the processor group.

Semantics of host affinity with ‘run.id’ 

Strategy to determine continuation of states within a samza application varies for different deployment environments and input sources. The semantic meaning of run.id is the continuation of states(viz state-store, checkpoint, config, task-assignments) across application restarts. Samza supports deployment and management of multi-stage data pipeline jobs consuming form bounded(batch) as well as unbounded(streaming) data sources. Host affinity will be supported only within the same run.id of a application.

Public Interfaces

zookeeper node)

Local store sandboxing:

In standalone landscape, the file system location to persist the local state should be provided by the users through stream processor configuration(by defining local.store.dir configuration). The configuration `local.store.dir` is expected to be preserved across processor restarts to reuse preexisting local state. It’s expected that the stream processor java process will be configured by user to run with sufficient read/write permissions to access the local state directories created by any processor in the group. The local store file hierarchy/organization followed in samza-yarn deployment model for both high and low level API will be followed in standalone.


Remove coordinator stream bindings from JobModel: 

JobModel is a data access object used to represent a samza job in both yarn and standalone deployment models. With existing implementation, JobModel requires LocalityManager(which is tied to coordinator stream) to read and populate processor locality assignments. However, since zookeeper is used as JobModel persistence layer and coordinator stream doesn’t exist in standalone landscape, it’s essential to remove this LocalityManager binding from JobModel and make JobModel immutable. Any existing implementations(ClusterBasedJobCoordinator, ContainerProcessManager) which depends upon this binding for functional correctness in samza-yarn, should directly read container locality from the coordinator stream instead of getting it indirectly via JobModel.

Cleaning up ContainerModel:

 ContainerModel is a data access object used in samza for holding the task to system stream partition assignments which is generated by TaskNameGrouper implementations. ContainerModel currently has two fields(processorId and containerID) used to uniquely identify a processor in a processors group. Standalone deployment model uses processorId and Yarn deployment model uses containerId field to store the unique processorId. To achieve uniformity between the two deployment models, the proposal is to remove duplicate containerId. This will not require any operational migration.

State store restoration:

 Upon processor restart, nonexistent local stores will be restored using the same restoration sequence followed in yarn deployment model.

Container to physical host assignment:

When assigning tasks to a stream processor in a run, the stream processor to which the task was assigned in the previous run will be preferred. If the stream processor to which task was assigned in previous run is unavailable in the current run, the stream processors running on physical host of previous run will be given higher priority and favored. If both of the above two conditions are not met, then the task will be assigned to any stream processor available in the processor group.

Semantics of host affinity with ‘run.id’ 

The strategy to determine if the state from the previous stream application run continues in the current run will vary for different deployment environments and input sources. The semantic meaning of run.id is the continuation of states(viz state-store, checkpoint, config, task-assignments) associated with a stream application across numerous stream application restarts. Host affinity will be supported only within the same run.id of a stream application.

Public Interfaces

Code Block
languagejava
// '+' denotes addition, '-' denotes deletion.
public interface TaskNameGrouper {
  + @Deprecated
  Set<ContainerModel> group(Set<TaskModel> tasks);

  + @Deprecated
  default Set<ContainerModel> group(Set<TaskModel> tasks, List<String> containersIds) {
    return group(tasks);
  }
  /**
   * @param taskModels, represents the taskModels generated by the SSPGrouper.
   * @param taskLocality, taskName to locationId mapping of the previous generation. 
   * @param processorLocality, processorId to locationId mapping.
   * @return the containerModels generated.   
   */  
  + 
Code Block
languagejava
// '+' denotes addition, '-' denotes deletion.
public interface TaskNameGrouper {
  + @Deprecated
  Set<ContainerModel> group(Set<TaskModel> tasks);

  + @Deprecated
  default Set<ContainerModel> group(Set<TaskModel> tasks, List<String> containersIds) {
    return group(tasks);
  }
  +   Set<ContainerModel> group(Set<TaskModel> taskModels, Map<String, LocationId>String> taskLocality, Map<String, LocationId>String> processorLocality);
}

+ @Deprecated
public interface BalancingTaskNameGrouper extends TaskNameGrouper {
  + @Deprecated 
  Set<ContainerModel> balance(Set<TaskModel> tasks, LocalityManager localityManager);
}

public class ContainerModel {
  - @Deprecated
  - private final int containerId;
  private final String processorId;
  private final Map<TaskName, TaskModel> tasks;
  + // New field added denoting the physical locationId.
  + private final String locationId;
}

+public interface LocationIdProvider {
   +  // In case of containerized environments, LocationId is a combination of multiple fields (sliceId, containerId, hostname) instead instead of simple physical hostname,
   +  // This will be provided by the execution environment of simple physical hostname,the processor.
   +  // Using a class to represent that, rather than a primitive string. This will be provided by execution environment.
   + LocationId getLocationId();
}

+ public interface LocalityManager {
   // returns the processorId to LocationId mapping.
  + public Map<String, LocationId>  readProcessorLocality();

  // returns the taskName to LocationId mapping.
  + public Map<String, LocationId> readTaskLocality();
 
  // writes the provided processordId to host mapping to underlying storage.
  + public boolean writeProcessorLocality(Map<String, LocationId> processorLocality);
}

For yarn, preferred host mapping in the coordinator stream will be used for locality of processors and tasks unchanged between successive generations. If tasks or processor added in a run, in yarn it will be assigned to any new host.   

Here are few reasons supporting the modification of TaskNameGrouper interface and removing LocalityManager from interface methods:

String getLocationId();
}


+ public interface MetadataStore {
  + // Gets the value associated with the specified {@code key}.
  + byte[] get(byte[] key);
  
  + // Updates the mapping of the specified key-value pair; Associates the specified {@code key} with the specified {@code value} 
  + void put(byte[] key, byte[] value);
 
  + // Deletes the mapping for the specified {@code key} from this store (if such mapping exists).
  + void remove(byte[] key);
}

LocationId reported by the live processors of the group and last reported task locality will be used to calculate the task to container assignment in standalone. Preferred host mapping will be used for task and processor locality in case of yarn. Any new task/processor for which grouping in unknown(unavailable in preferred host/task-locality in underlying storage layer), will be treated as any_host during assignment.

Here are few reasons supporting the modification of TaskNameGrouper interface and removing LocalityManager from interface methods:

  • Multiple group methods in TaskNameGrouper interface and additional balance method in BalancingTaskNameGrouper are logically synonymous to each other and exists to generate ContainerModels based upon the input task models and past locality assignments. It’s sensible to combine them into one interface method with adequate parameters and simplify things.

  • Any future TaskNameGrouper implementation could hold references to LocalityManager(a live object) and create object hierarchies based upon that reference. This will clutter the ownership of LocalityManager and could potentially create an unintentional resource leak

  • Any TaskNameGrouper implementation should be usable in both yarn and standalone deployment models. However, TaskNameGrouper interface definition has an explicit and tight binding with CoordinatorStream through LocalityManager and existing TaskNameGrouper implementations employs LocalityManager to read/write locality mapping from and to Coordinator stream. Coordinator stream doesn’t exist in standalone landscape and this prohibits usage of some TaskNameGrouper implementations in standalone.

  • Multiple group methods in TaskNameGrouper interface and additional balance method in BalancingTaskNameGrouper are logically synonymous to each other and exists to generate ContainerModels based upon the input task models and past locality assignments. It’s sensible to combine them into one interface method with adequate parameters and simplify things.

  • Any future TaskNameGrouper implementation could hold references to LocalityManager(a live object) and create object hierarchies based upon that reference. This will clutter the ownership of LocalityManager and could potentially create an unintentional resource leak.

  • Number of processors is a static configuration in yarn deployment model and a job restart is required to change the number of processors. However, an addition/deletion of a processor to a processors group in standalone is quite common and an expected behavior. Existing generators discard the task to physical host assignment when generating the JobModel. However, for standalone it’s essential to consider this detail(task to physical host assignment) between successive job model generations to accomplish optimal task to processor assignment. For instance, let’s assume stream processors P1, P2 runs on host H1 and processor P3 runs on host H3. If P1 dies, it is optimal to assign some of the tasks processed by P1 to P2. If previous task to physical host assignment is not taken into account when generating JobModel, this cannot be achieved.

  • Logically, a TaskNameGrouper implementation would just require the previous generation container models(to get previous task to preferred host mapping, previous task to systemstreampartition mapping) which can be passed in through the interface method to generate new mapping. Any modifications to existing assignments should be done outside of TaskNameGrouper implementation. This will make any implementation as a pure function simply operating on the passed in data.

After this change, we will have one method in TaskNameGrouper interface clearly defining the contract and all other methods in TaskNameGrouper will be deprecated(eventually removed). Host aware task to stream processors assignment in standalone will be housed in a TaskNameGrouper implementation which will be used to support this feature.
Implementation and Test Plan

  • Modify the existing interfaces and classes as per the proposed solution.

  • Add unit tests to test and validate compatibility and functional correctness. 

  • Add a integration test tests in samza standalone samples to verify the host affinity feature. 

  • Add an integration test to verify that there are minimal partition movements during rolling upgrade.

  • Verify compatibility - Jackson, a java serialization/deserialization library is used to convert data model objects in samza into JSON and back. After removing containerId field from ContainerModel, it should be verified that deserialization of old ContainerModel data with new ContainerModel spec works. 

  • Some TaskNameGrouper implementations assumes the comparability of integer containerId present in ContainerModel(for instance - GroupByContainerCount, a TaskNameGrouper implementation). Modify existing TaskNameGrouper implementations to take in collection of string processorId’s, as opposed to assuming that containerId is integer and lies within [0, N-1] interval(without incurring any change in functionality).

Compatibility, Deprecation, and Migration Plan

...

  • in functionality).

Compatibility, Deprecation, and Migration Plan

  • We are not changing the existing data storage format of the ContainerModel in coordinator stream for yarn deployment model.
  • ContainerId field in ContainerModel which is deprecated in samza 0.13 version will be removed in the future release. Open source users using containerId field from ContainerModel should migrate and use processorID field in ContainerModel.
  • All of the existing methods in TaskNameGrouper and BalancingTaskNameGrouper will be deprecated. 
  • It’s recommended that the users recompile their deployable after migrating to the samza version that has this feature.
  • Will add compatibility test to verify that deprecating/changing the TaskNameGrouper API changes does not alter the existing behaviors.

...

LocalityManager will be turned to an interface and there will be two implementations of LocalityManager viz CoordinatorStreamBasedLocalityManager to read/write container locality information for yarn and ZkLocalityManager to read/write container locality information for standalone.

Cons: 

...

  • Any TaskNameGrouper implementation could hold references to LocalityManager(a live object) and create object hierarchies based upon that reference. This will clutter the ownership of LocalityManager and could potentially create an unintentional resource leak.

Approach 2

GroupByContainerIds is the only TaskNameGrouper currently supported in standalone. Implement the host aware task to stream processors assignment for standalone in GroupByContainerIds.

...