Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • We should also periodically check ledgers are available. Where should this run from?

The problem can be split into two parts, detection and recovery.

h1. Detection

Currently, we have no automated mechanism to check whether a bookie contains all the ledger entries it should, which can potentially lead to underreplication in the whole system. We need a mechanism to ensure that a bookie contains the entries which zookeeper says it does.

The brute force mechanism here would be for each bookie to get a list of ledger fragments it should have, and then read all entries in the fragment and check that the checksum is correct. A lighter approach would be to only check the first and last entry of a fragment. This could be expensive on systems which had many small ledgers though.

What about the case where a whole bookie disappears?

h1. Recovery

Once we detect that a fragment is underreplicated, who should run the process to recover it. How do we prevent two actors from attempting to recovery a fragment at the same time and potentially overload the system?