Each AsterixDB cluster today consists of one or more Node Controllers (NC) where the data is stored and processed.
Each NC has a predefined set of storage partitions (iodevices).
When data is ingested into the system, the data is hash-partitioned across the total number of storage partitions in the cluster.
Similarly, when the data is queried, each NC will start as many threads as the number storage partitions it has to read and process the data in parallel.

While this shared-nothing architecture has its advantages, it has its drawbacks too.
One major drawback is the time needed to scale the cluster.
Adding a new NC to an existing cluster of (n) nodes means writing a completely new copy of the data which will now be hash-partitioned to the new total number of storage partitions of (n + 1) nodes.
This operation could potentially take several hours or even days which is unacceptable in the cloud age.

This APE is about adding a new deployment (cloud) mode to AsterixDB by implementing compute-storage separation to take advantage of the elasticity of the cloud.
This will require the following:

  1. Moving from the dynamic data partitioning described earlier to a static data partitioning based on a configurable, but fixed during a cluster's life, number of storage partitions.
  2. Introducing the concept of a "compute partition" where each NC will have a fixed number of compute partitions. This number could potentially be based on the number of CPU cores it has.

This will decouple the number of storage partitions being processed on an NC from the number of its compute partitions.

When an AsterixDB cluster is deployed using the cloud mode, we will do the following:

  • The Cluster Controller will maintain a map containing the assignment of storage partitions to compute partitions.
  • New writes will be written to the NC's local storage and uploaded to an object store (e.g. AWS S3) which will be used as a highly available shared filesystem between NCs.
  • On queries, each NC will start as many threads as its compute partitions to process its currently assigned storage partitions.
  • On scaling operations, we will simply update the assignment map and NCs will lazily cache any data of newly assigned storage partitions from the object store.

Improvement tickets:

  • No labels