Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This document describes how Majordodo takes advantage of Bookeeper in order to implement a replicated state machine with no need of sharing any other medium other than BookKeeper. The global status of the system is kept on a bunch of machines (Brokers), being every change to the status first written to the log and then applied to memory.

An introduction to Majordodo

Majordodo is a an opensource (Apache 2 Licensed) Distributed Resource Manager developed at Diennea, with the main purpose to allocate resources to execute lots of concurrent tasks submitted by lots of users.

...

Majordodo is built upon Apache BookKeeper and Apache ZooKeeper, leveraging these powerful systems to implement replication and face all the usual distributed computing issues.

Majordodo and ZooKeeper

Majordodo Clients use ZooKeeper to discover active Brokers on the network. On the Broker side Majordodo exploits ZooKeeper for many situations, using it directly to address leader election, advertise the presence of services on the network and keep metadata about BookKeeper ledgers. BookKeeper, in turn, uses ZooKeeper for Bookie discovery and for ledger metadata storage.

...

In order to support Brokers discovery each Broker advertises its presence on an ephemeral sequential node (such as '/majordodo/discoverypath/brokers0000007791').

BookKeeper Usage Overview

Apache BookKeeper is used to implement a distributed commit log with a shared-nothing architecture: no shared disk or database is needed to let the Brokers share the same view of the global status of the system. BookKeeper is ideal for replicating the state of Brokers, where the leader Broker keeps the global view of the status of the system in memory and logs every change to a Ledger. BookKeeper is used as a write-ahead commit log, that is, every change to the status is written to the log and then applied to the in-memory status. Other Brokers (referred to as 'followers') 'tail' the log and apply each change to their own copy of the status.

Since BookKeeper ledgers can be written only once, if another Broker starts the recovery process and opens the ledger for reading it, automatically fences the previous writer so as to allow no more writes on that ledger. Thus, in case of leadership change, for instance in case of temporary network failures, the 'past' leader Broker is no longer able to log entries and, as a consequence, cannot 'change' anymore the global status of the system in memory.

Tailing the Log and Replicating System Status

Followers continuously read BookKeeper logs and replay each action on their local copy of system state.

...

This 'tail' operation is very fast because of the internal design of the Bookie service that keeps in memory the last entries written by the client.

Snapshots

The only shared structures between Brokers are the ZooKeeper 'filesystem' and the BookKeeper ledgers, but logs cannot be retained forever, accordingly each Broker must periodically take a snapshot of its own in-memory view of the status and persist it to disk in order to recover quickly and in order to let BookKeeper release space and resources.

...

If no local snapshot is available the booting Broker discovers an active Broker in the network and downloads a valid snapshot from the network. For a Majordodo Broker with 500.000 active tasks the medium gzipped snapshot file size is around 20 MBytes, depending on task data payload.

Ledgers Management

The leader Broker keeps the list of active ledgers on ZooKeeper, storing for each ledger its id and creation timestamp.

...