Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Metadata entity:
    • Dataset: 
      for each dataset record, there is one field called "rebalanceCount".  If the dataset has never been rebalanced, it is Missing.
    • Nodegroup:
      when a dataset foo is created, we internally create a nodegroup with name foo_i (or just foo if i=0) where i=foo.rebalanceCount based on the current available nodes. Then, we let the nodegroup of foo be foo_i.

  • Primary/secondary index file directory layout
    • If the rebalanceCount of the dataset is 0,  the file directory layout of indexes is the same as before.
    • If the rebalanceCount of the dataset is larger than 0,  index files are under a nested directory in the dataset's directory with name rebalanceCount.

  • For each shadow dataset foo, repeat the following process:
    1. create a new node group foo_i (where i= foo.rebalance_count + 1) that contains the current available nodes, if the node group has already been occupied, we let the new node group have name foo_<uuid>;

    2. create an uncommitted dataset with same name foo (on node group foo_<i>) using node group foo_<i> with the same rebalance_count; (in the following description, we will call this dataset "rebalance target" and call the original dataset foo "rebalance source".)

    3. drop any leftover files for the uncommitted dataset foorebalance target;

    4. upsert all documents from foo to foo (on node group foo_<i>) on all partitions

    5. update the metadata entity for dataset foo,  make the uncommitted foo become the committed foo in metadata

    6. from rebalance source to rebalance target on all partitions

    7. check the existence of foo – if foo does not exist in metadata, drop the files for rebalance target. Update the metadata entity of dataset foo switch to the rebalance target.

    8. drop files of the rebalance source and drop files foo  (on node group foo_i-1) and drop node group foo_<i-1>


    • There are three metadata transactions for step 1 to 6:

      1. step 1-4,  locks – read lock on foo and read lock on node group foo.nodegroup.

      2. step 5: write lock on foo, conditional read lock on node group foo.nodegroup.

      3. step 6: read lock on foo and node group foo.nodegroup_(i-1) (the same as foo.nodegroup in metadata transaction a)


    • The locks in metadata transaction a to c makes sure that read-only queries are allowed for the most time except metadata transaction b.                 

  • Concurrency:
          Since we cut the rebalance process into three metadata transactions, other metadata write operations could interleave with the rebalance process.
    • CASE 1: if foo is dropped between metadata transaction a and b.  At the beginning of step 5, we check the existence of foo and drop target files if foo is dropped between transaction a and b.
    • CASE 2: if foo is dropped between metadata transaction b and c.  This time, it's the rebalance target that gets dropped.  Therefore, step 6 is independent to the drop operation.

  • Idempotent property:
                         Idempotent property: