Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page is to talk about Fscking and autorecovery of ledgers and bookies, so that we can discuss and get a clear story of what needs to be done, from which we can then derive a list of JIRAs. https://cwiki.apache.org/confluence/pages/editpage.action?pageId=27844384Image Added

State of the plan (as of 14 June 2012)

...

Within the system there is only one auditor. Each bookie runs an auditor thread and they use zookeeper to elect which one gets to be auditor. If the auditor fails, then the election is run again. As per the latest discussion Auditor may be started as separate process instead of running it along with the bookies.

The roll of the auditor is to watch for bookie failure, and when a failure does occur, mark all ledgers with fragments on that bookie to be rereplicated.

...

Each bookie in the cluster runs a recovery worker. The recovery worker watches /ledgers/underreplicated for new ledgers to appear. When a ledger does appear, the recovery worker will lock it, and run rereplication on it. If the recovery worker fails to acquire the lock, it tries the next ledger.
On successful rereplication, recovery worker deletes the ledger from /ledgers/underreplicated and also releases the lock.

Open Questions

  • We should also periodically check ledgers are available. Where should this run from?

...