This page is to talk about Fscking and autorecovery of ledgers and bookies, so that we can discuss and get a clear story of what needs to be done, from which we can then derive a list of JIRAs. https://cwiki.apache.org/confluence/pages/editpage.action?pageId=27844384^{Image Added}

State of the plan (as of 14 June 2012)

...

Within the system there is only one auditor. Each bookie runs an auditor thread and they use zookeeper to elect which one gets to be auditor. If the auditor fails, then the election is run again. As per the latest discussion Auditor may be started as separate process instead of running it along with the bookies.

The roll of the auditor is to watch for bookie failure, and when a failure does occur, mark all ledgers with fragments on that bookie to be rereplicated.

...

Each bookie in the cluster runs a recovery worker. The recovery worker watches /ledgers/underreplicated for new ledgers to appear. When a ledger does appear, the recovery worker will lock it, and run rereplication on it. If the recovery worker fails to acquire the lock, it tries the next ledger.
On successful rereplication, recovery worker deletes the ledger from /ledgers/underreplicated and also releases the lock.

Open Questions

We should also periodically check ledgers are available. Where should this run from?

...

Child pages

Versions Compared

Old Version 2

New Version 3

Key

State of the plan (as of 14 June 2012)

Open Questions

Child pages

Page History

Versions Compared

Old Version 2

New Version 3

Key

State of the plan (as of 14 June 2012)

Open Questions