Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Use-case #1 is for fault tolerance. Putting more than one replica on the same node does not help with redundancy.
  2. Use-cases #2 and #3 are for for fault tolerance and cost minimzation:
    1. You want to survive one out of three zones failing so you need to distribute shards equally among at least two zones.
    2. You want to have (approximately) equal capacity for each shard in these zones so a zone outage doesn't eliminate a majority of your capacity.
    3. You want to have at least one replica of each shard in a given zone so that you can minimize cross-AZ traffic for searching (which is chargeable in AWS)
    4. Taking all the above scenarios on mind, either all shards of a collection must be hosted in the same two zones or all shards are hosted equally in all three zones to provide both fault tolerance as well as to minimize inter-az cost.
  3. Use-case #4 is useful for workload partitioning for writes vs reads e.g. you might want to pin TLOG replicas to a certain node type optimized for indexing and PULL replicas on nodes optimized for searching.
  4. Use-case #5 is for workload partitioning between analytics, search and .system collections so you can have collections specific to those workloads on nodes optimized for those use-cases.
  5. Use-case #6 is useful to implement autoscaling node groups such that a specific number of nodes are always available and the rest come and go without causing data loss or moving data each time we scale down. It is also useful for workload partitioning between analytics and search use-case e.g. we might want to dedicate a few replicas for streaming expressions and spark jobs on nodes optimized for those and keep other replicas for search only.
  6. Use-case #7 is for balanced utilization of all nodes. This is tricky with disk usage or heavily/lightly loaded collections.
    Multi-tenant use-cases (think a collection per tenant) are trickier because now you want to take care of blast radius as well in case a node or zone goes down.

...

From Ilan:

A minimalistic autoscaling would offer the following properties, expressed in very vague terms:

  • Prevent multiple replicas of a shard from being placed on same node,
  • Try to spread replicas on all nodes (random placement ok)
  • Try to spread replicas for different shards of same collection on all nodes (so that concurrent execution of queries on multiple replicas does use available compute power)
  • Try to spread leaders on all nodes

Then, a periodic task/trigger moving replicas and leaders around would correct the imbalance that may result from the above operations, that therefore can be imperfect, hence the use of "try" in the descriptions (would also support adding a new empty node for example).

On top of the above, being able to spread not only on nodes but on groups of nodes (for example groups representing AZ's) would be helpful.
Nice to have also is auto add replica when nodes go down.

...

Please add your user stories in the following sections...

...