This page is to talk about Fscking and autorecovery of ledgers and bookies, so that we can discuss and get a clear story of what needs to be done, from which we can then derive a list of JIRAs.

State of the plan (as of 14 June 2012)

We have two new entities running on bookies. The auditor and the recovery worker.

The Auditor

Within the system there is only one auditor. Each bookie runs an auditor thread and they use zookeeper to elect which one gets to be auditor. If the auditor fails, then the election is run again.

The roll of the auditor is to watch for bookie failure, and when a failure does occur, mark all ledgers with fragments on that bookie to be rereplicated.

There is a znode, /ledgers/underreplicated under which the ledgers to be rereplicated are placed.

The recovery worker

Each bookie in the cluster runs a recovery worker. The recovery worker watches /ledgers/underreplicated for new ledgers to appear. When a ledger does appear, the recovery worker will lock it, and run rereplication on it. If the recovery worker fails to acquire the lock, it tries the next ledger.

Open Questions

We should also periodically check ledgers are available. Where should this run from?

The problem can be split into two parts, detection and recovery.

h1. Detection

Currently, we have no automated mechanism to check whether a bookie contains all the ledger entries it should, which can potentially lead to underreplication in the whole system. We need a mechanism to ensure that a bookie contains the entries which zookeeper says it does.

The brute force mechanism here would be for each bookie to get a list of ledger fragments it should have, and then read all entries in the fragment and check that the checksum is correct. A lighter approach would be to only check the first and last entry of a fragment. This could be expensive on systems which had many small ledgers though.

What about the case where a whole bookie disappears?

h1. Recovery

Once we detect that a fragment is underreplicated, who should run the process to recover it. How do we prevent two actors from attempting to recovery a fragment at the same time and potentially overload the system?

Child pages

Fsck and autorecovery

State of the plan (as of 14 June 2012)

The Auditor

The recovery worker

Open Questions