Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Clarify enumerate/list

...

The number of partitions may also be very large and Copycat does not require that all partitions be enumeratedlisted. A somewhat extreme example of this would be metrics collected from a large number of hosts, e.g. application stats collected via JMX or OS-level stats collected via ganglia. Logically, we can represent these as a very large number of partitions, with one partition per host per metric (perhaps hundreds of thousands or even millions across a data center) and Unix timestamps as offsets.

...

In this simple case, we can easily enumerate list all the partitions and assign them simply by dividing the list between the tasks. However, this approach is not required. The Connector is responsible for assigning partitions to tasks and it can use any approach that makes sense. For example, the metrics example from earlier might look like this:

 

Instead of enumerating listing every single metric across all hosts or application processes, the connector might only divide work between tasks at the granularity of hosts and might even specify this as a range of hosts rather than actually listing the full set of hosts. So in the example, the connector could generate configs that specify the range of hosts the task should handle (e.g. server.range=a-m and server.range=n-z) and tasks, which are implemented as part of the same connector plugin, know to handle all metrics for all servers with hostnames in that range. It is important that each metric be its own partition so that offsets can be tracked for each individually (enabling correct handling of failures), but the Copycat framework does not need to know the full list of partitions or exactly how they are assigned to tasks.

...