Rebalance API and Internal Implementation

HTTP API

POST:
example:
DELETE:
example

Internal Implementation:

Metadata entity:
- Dataset:
  for each dataset record, there is one field called "rebalanceCount". If the dataset has never been rebalanced, it is Missing.
- Nodegroup:
  when a dataset foo is created, we internally create a nodegroup with name foo_i (or just foo if i=0) where i=foo.rebalanceCount based on the current available nodes. Then, we let the nodegroup of foo be foo_i.
Primary/secondary index file directory layout:
- If the rebalanceCount of the dataset is Missing, the file directory layout of indexes is the same as before - index files are directly under in the dataset's directory.
- If the rebalanceCount of the dataset is larger than 0, index files are under a nested directory in the dataset's directory with name rebalanceCount.
For each shadow dataset foo, repeat the following process:
1. create a new node group foo_i (where i= foo.rebalance_count is missing? 1: foo.rebalance_count + 1) that contains the current available nodes, if the node group has already been occupied, we let the new node group have name foo_<uuid>;
2. create an uncommitted dataset with same name foo (on node group foo_<i>) using node group foo_<i> with the same rebalance_count; (in the following description, we will call this dataset "rebalance target" and call the original dataset foo "rebalance source".)
3. drop any leftover files for rebalance target;
4. upsert all documents from rebalance source to rebalance target on all partitions
5. check the existence of foo – if foo does not exist in metadata, drop the files for rebalance target. Update the metadata entity of dataset foo switch to the rebalance target.
6. drop files of the rebalance source and drop node group foo_<i-1>
- There are three metadata transactions for step 1 to 6:
1. 1. step 1-4, locks – read lock on foo and read lock on node group foo.nodegroup.
  2. step 5: write lock on foo, conditional read lock on node group foo.nodegroup.
  3. step 6: read lock on foo and node group foo_(i-1) (the same as foo.nodegroup in metadata transaction a)

- The locks in metadata transaction a to c makes sure that read-only queries are allowed for the most time except metadata transaction b.
- We make sure step 5 and 6 can sustain InterruptedException, which means we will keep retrying metadata transaction 5 and 6 until success, in the event of interrupted exceptions.
Concurrency:
Since we cut the rebalance process into three metadata transactions, other metadata write operations could interleave with the rebalance process.
- CASE 1: if foo is dropped between metadata transaction a and b. At the beginning of step 5, we check the existence of foo and drop target files if foo is dropped between transaction a and b.
- CASE 2: if foo is dropped between metadata transaction b and c. This time, it's the rebalance target that gets dropped. Therefore, step 6 is independent to the drop operation.
Idempotent property:

- The whole rebalance process is idempotent in case it fails in the middle:
  - In metadata transaction a, since there is no metadata modifications and step 3 drops leftover files for the rebalance target, it is idempotent if failure or crash happens at any point.
  - In metadata transaction b and c, since they sustain InterruptedException, they will never be interrupted in the middle as long as the node doesn't crash.
  - In metadata transaction b and c, if the node or JVM crashes,
    - if the metadata entity was switched (depending on what's there in the metadata node), the system uses the rebalance source as foo;
    - if the metadata entity was switched, the system uses the rebalance target as foo;
  - In the event of failures, there could be leaked source files (from metadata transaction a) which will be reclaimed in the next rebalance operation, or leaked target files (from metadata transaction b) which will not be reclaimed.

Page tree

Rebalance API and Internal Implementation