Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Cluster Design Note

A Qpid cluster is a group of brokers co-operating to provide the
illusion of a single "virtual broker" with some extra qualities:

  • The cluster continues to provide service as long as at least one member survives.
  • If a client is disconnected unexpectedly from a cluster member it can fail-over to another cluster member, giving the impression of uninterupted service.

This design discusses clustering at the AMQP protocol layer,
i.e. members of a cluster have distinct AMQP addresses and AMQP
protocol commands are exchanged to negotiate reconnection and
failover. "Transparent" failover in this context means transparent to
the application, the AMQP client will be aware of disconnects and must
take action to fail over.

More precisely we define transparent failover to mean this: In the
event of failover, the sequence of protocol commands sent and received
by the client on each channel excluding failover-related commands is
identical to the sequence that would have been sent/received if no
failure had occured.

Given this definition the failover component of an AMQP client library
can simply hide all failover-related commands from the application
(indeed from the rest of the library) without breaking any semantics.

Requirements

TODO: Define levels of reliability we want to provide - survive one
node failure, survive multiple node failures, survive total failure,
network partitions etc. Does durable/non-durable message distinction
mean anything in a reliable cluster? I.e. can we lose non-durable
messages on a node failure? Can we lose them on orderly shutdown or
total failure?

TODO: The requirements triangle. Concrete performance data.

Clients only need to use standard AMQP to talk to a cluster. They need
to understand some AMQP extensions to fail-over or resume sessions.
We will also use AMQP for point-to-point communication within the
cluster.

Ultimately we may propose extensions to AMQP spec but for the initial
implementation we can use existing extension points:

  • Field table parameters to various AMQP methods (declare() arguments etc.)
  • Field table in message headers.
  • System exchanges and queues.

Abstract Model and Terms

A quick re-cap of AMQP terminology and introduction to some new terms:

A broker is a container for 3 types of broker components: queues,
exchanges and bindings. Broker components represent resources
available to multiple clients, and are not affected by clients
connecting and disconnecting. Persistent broker components are
unaffected by shut-down and re-start of a broker.

Wiki Markup
A _client_ uses the components contained in a broker via the AMQP
protocol. The _client components_ are _connection_, _channel_, _consumer_ and
_session_\[[Footnote(The "session" concept is not fully defined in AMQP 0-8 or 0-9 but is under discussion. This design note will define a session that may be proposed to AMQP.)]. Client components represent the relationship between a client
and a broker.

TODO: _Where do transactions fit in the model? They are also a kind of
relationship components but Dtx transactions may span more than just
client/broker._

Wiki Markup
A client's interaction with a unclustered _individual broker_\[[Footnote(An individual broker by this definition is really a broker behind a single AMQP address. Such a broker might in fact be a cluster using technologies like TCP failover/load balancing. This is outside the scope of this design, which focusses on clustering at the AMQP protocol layer, where cluster members have separate AMQP addresses.)] is
defined by AMQP 0-8/0-9: create a connection to the brokers _address_,
create channels, exchange AMQP commands (which may create consumers),
disconnect. After a disconnect the client can reconnect to the same
broker address. Broker components created by the previous connection
are preserved but client components are not. In the event of a
disorderly disconnect the outcome of commands in flight can only be
determined by their effects on broker components.

Wiki Markup
A _broker cluster_ (or just _cluster_) is a "virtual" broker implemented
by several _member brokers_ (or just _members_.) A cluster has many AMQP
addresses - the addresses of all its members - all semantically
equivalent for client connection \[[Footnote(They may not be equivalent on other grounds, e.g. network distance from client, load etc.)]. The cluster members co-operate to
ensure:

  • all broker components are available via any cluster member.
  • all broker components remain available if a member fails, provided at least one member remains active.
  • clients disconnected by member failure or network failure can reconnect to another member and resume their session.

A session is an identity for a collection of client components (i.e. a
client-broker relationship) that can outlive a single connection. AMQP
0-9 provides some support for sessions in the `resume` command.

If a connection is closed in an orderly manner by either end, the
session is also closed. However if there is an abrupt disconnect with
no Connection.close command, the session remains viable for some
(possibly long) timeout period. The client can reconnect to a failover
candidate and resume.

A session is like an extended connection: if you cut out the failover
commands the conversation on a session is exactly the conversation the
client would have had on a single connection to an individual broker
with no failures.

If a connection is in a session, events in the AMQP.0-8 spec that are
triggered by closing the connection (e.g. deleting auto-delete queues)
are instead trigged by the close (or timeout) of the session.

Note the session concept could also be used in the absence of
clustering to allow a client to disconnect and resume a long-running
session with the same broker. This is outside the scope of this
design.

Cluster State

We have to consider several kinds of state:

  • Wiring: existence of exchanges, Queues and bindings between them.
  • Content: messages on queues and references under construction.
  • Cluster membership: list of members
  • Conversational state: Sessions, channels, prefetch windows, consumers, acks, etc.

Wiring and content needs to be replicated in a fault tolerant way
among the brokers so all brokers can provide access to all
resources. Persistent data also needs to be stored persistently for
recovery from total failure or deliberate shutdown.

Cluster membership needs to be replicated among brokers and passed to
the client. The client needs to know which members are candidates for
failover.

Conversational state relates to a client-broker relationship:

  • session identity.
  • open channels, channel attributes (qos).
  • consumers.
  • commands "in flight" - sent but not acknowledged.
  • acknowledged command history.

To resume successfully the converstaional state must be re-established
on both sidess. There are several options about how much state is
stored where. We'll outline the solution that minimizes broker-side
replication, but it's not clear yet if this is the optimal choice.

To minimize converstaional replication on the broker, the broker must
replicate at least:

  • session identities: to recognize the session on reconnect.
  • history of acknowleded commands: to ignore duplicates resent by client.
  • commands in flight: to resend to client and/or handle client responses.

Everything else can be re-established by the client:

  • re-open the channels.
  • re-set qos settings.
  • re-create consumers.
  • ignore incoming duplicates during reconnect.
  • re-send all unacknowledged commands.

Note that all of the above can be accomplished with normal AMQP commands.

The broker could replicate more converstaional state to relieve the
client from re-creating it. It's not clear yet where the best tradeoff
lies since it's not possible to have 0 conversational state on the
broker. Minimizing the brokers role seems like a good approach since
replicating data affects the whole cluster, keeping it on the
client affects only one client-broker connection.

Replication

The different types of state are replicated in different ways to
strike the best performance/reliability balance.

Cluster membership and Wiring

Membership changes are replicated to entire cluster symmetrically
using a virtual synchrony protocol.

Connected clients are also notified of changes to the set of fail-over
candidates for that client. Clients are notified over AMQP by binding
a queue to a special system exchange.

Wiring is low volume so and can also be replicated via virtual
synchrony. Cluster membership + wiring make up the common "picture"
that every member has of the cluster.

Queue Content

Message content is too high volume to replicate to the entire cluster.
To limit the extent of replication each queue or reference plays
exactly one of the following roles in each broker:

  • Primary: content owner, all modifications are done by primary.
  • Proxy: Forwards client requests to the primary.
  • Backup: A proxy that also receives a replica of the content and can take over if the primary fails.
  • Fragment: special case for shared queues, see below.

A single cluster member may contain a mix of primary, backup and proxy
queues. (TODO: mechanism for establishing primary, backup etc.)

The client is unaware of the distinction, it sees an identical picture
regardless of what broker it connects to.

TODO: Ordering issues with proxys and put-back messages (reject,
transaction rollback) or selectors.

Fragmented Shared Queues

A shared queue has reduced ordering requirements and increased
distribution requirements. "Fragmenting" a shared queue is a special
type of replication where each of a set of brokers holds a disjoint
subset (fragment) of the messages on the queue. The idea is to
distribute load over independent fragments hosted in separate brokers.

Each fragment's content is replicated to backups independently just
like a normal queue.

The appearance of a single queue is created by collaboration between
fragments. Fragments store incomging messges in the local queue, and
server local consumers from the local queue whenever possible. Only
when a fragment cannot satisfy its consumers does it consume messages
from other fragments in the group.

Proxies to a fragmented queue will consume from the "nearest" fragment
if possible.

TODO: Proxies can play a more active role. Ordering guarantees, we can
provide "same producer to same consumer preserves order" since
messages from the same producer always go on the same fragment
queue. May break down in the presence of failover unless we remember
which fragment received messges from the client and proxy to the same
one on the failover replica.

Conversational State

The minimum converstaional state a broker needs to replicate for failover is

  • session identities: to recognize the session on reconnect.
  • history of processed request-ids: to ignore duplicates resent by client.
  • all requests in flight: to resend to client and/or handle client responses.
  • open incoming references.

Session identities and processed requiest history is very low volume
and could be replicated with virtual synchrony. However commands in flight
and open incoming references are probably too large to be replicated this
way.

Enter the session backup: the broker directly connected to a client is
the session primary. The session primary has one or more session backup
brokers. Conversational state is replicated only to the session
backups, not to the entire group. On failure the client must reconnect
to one of the session backups, other members will not accept the
session.

Shared storage

In the case of high-performance, reliable shared storage (e.g. GFS)
queue content can be stored instead of replicated to backups. In that
case the conversations with content backups can be dehydrated, on
failover the backup can recover data from shared store.

For commodity hardware cases we need a solution with no shared store
as described above.

TODO: Would we wan to use shared store for conversational state?

Client-Broker Protocol

TODO: How it looks in client protocol terms - system queues &
exchanges, connect argument tables.

Membership updates, choosing a replica, reconnection,
building conversational state => client stores representation of state
unacked messages & duplicates, hiding failover.

Broker-Broker Protocol

Broker-broker communication uses normal AMQP over specially identified
connections and channels (identified in the connection negotiation
argument table.)

Proxying: Proxies simply forward methods between client and primary
and create consumers on behalf of the client. TODO: _any issues with
message put-back and transactions?_

Queue/fragment replication: Use AMQP transfer commands to transfer content
to backup(s). TODO: propagating transactions.

Session replication:
must replicate a command (and get confirmation it was
replicated) before responding. However this can be done in async
streams - forward commands to replica

Session replication: AMQP on special connections. Primary forwards all
outgoing requests and incoming responses to session backup. Backup can
track the primary request/response tables and retransmit messages.

Note the
dehydrated requests to the session backup should reference the content
backup so the backup can recover content in a failure. Alternatively
content could be sent to both - what's the tradeoff?

Persistence and Recovery

Use cases for persistence

Durable messages: must be stored to disk across broker shutdowns.

Flow-to-disk: to cope with memory overflow on full queues.

Reliability: recover after a crash.

We need a common solution - it would make no sense to be writing the
same message to disk multiple times for different purposes. The
solution may be tunable to offer better performance for individual
cases but should cover the case where all 3 are required.

Competing failure modes:

Tibco: fast when running clean but performance over time has GC
"spikes" Single journal for all queues. "holes" in log have to be
garbage collected to re-use the log. 1 slow consumer affects everyone
because it causes fragmentation of the log.

MQ: write to journal, write journal to DB, read from DB.
Consistent & reliable but slow.

Street homegrown solutions: transient MQ with home grown
persistence. Can we get more design details for these solutions?

Persistence approach

Try to avoid the problems with:

  • journal-per-queue
  • multiple segment files per journal
  • GC to "snapshot" (per queue? per broker? think about queue migration.)
  • Disk thrashing - how do we distribute the parts to avoid it?

Need to think about how we consolidate journal into a "snapshot" or
"database" to reclaim space.

No write on fast consume: Optimization - if we can deliver (and get
ack) faster than we write then no need to write. How does this
interact with HA?

Async journalling: writing to client, writing to journal, acks from
client, acks from journal are separate async streams? So if we get
client ack before the journalling stream has written the journal we
cancel the write? But what kind of ack info do we need? Need a diagram
of interactions, failure points and responses at each point. Start
simple and optimize, but dont rule out optimizations.

What about persistence-free reliability?

Is memory-only replication with no disk a viable option for high-speed
transient message flow? We will lose messages in total failure or
multiple failures where all backups fail, but we can survive single
failures and will run a lot faster than diskful.

Virtual synchrony

TODO: Wiring & membership via virtual synchrony
TODO: journaling, speed. Will file-per-q really help with disk burnout?

Configuration

Connection issues: potentiall for big N*N connection web if primaries
and backups are not grouped.

Simplifying patterns:

  • Virtual hosts as units of replication.
  • Backup rings: all primary components in a broker use the same backup broker and vice-versa. Backups form rings.
  • Disk management issues?
  • Shared storage issues?

Dynamic cluster configuration

  • Failover: the primary use case.
  • Add node: backup, proxy, primary case?
  • Redirect clients from loaded broker (pretend failure)
  • Move queue primary from loaded broker/closer to consumers?
  • Re-start after failover.

Issue: unit of failover/redirect is connection/channel but "working
set" of queues and exchanges is unrelated. Use virtual host as unit
for failover/relocation? It's also a queue namespace...

If a queue moves we have to redirect its consumers, can't redirect
entire channels! Channels in the same session may move between
connections. Or rather we depend on broker to proxy?

Backups: chained backups rather than multi-backup? Ring backup?
What about split brain, elections, quorums etc.

Should new backups acquire state from primary, from disk or possibly
both? Depends on GFS/SAN vs. commodity hw?

Open Questions

Issues: double failure in backup ring: A -> B -> C. Simultaneous
failure of A and B. C doesn't have the replica data to take over for A.

Java/C++ interworking - is there a requirement? Fail over from C++ to Java?
Common persistence formats?

Implementation breakdown.

The following are independently useful units of work that combine to
give the full story:

Proxy Queues: Useful in federation. Pure-AMQP proxies for exchanges
might also be useful but are not needed for current purpose as we will
use virtual synchrony to replicate wiring.

Fragmented queues: Over pure AMQP (no VS) useful by itself for unreliable
high volume shared queue federation.

Virtual Synchrony Cluster: Multicast membership and total ordering
protocol for brokers. Not useful alone, but useful with proxies and/or
fragments for dynamic federations.

Primary-backup replication: Over AMQP, no persistence. Still offers some
level of reliability in a simple primary-backup pair.

Persistence: Useful on its own for flow-to-disk and durable messages.
Must meet the performance requirements of reliable journalling.

...

<muse:fn-sep?/> Exclusive or auto-delete queues are deleted on disconnect, we'll return to this point.

NB: this is similar to Old Clustering Design Note in many ways. The principal differences are:

  • Active state propagation between primary/backup only. Propagation to other proxies is pulled on demand.
  • Use Open AIS protocol for more efficient cluster state management.
  • Transaction logging for fast persistent backup in case of more widespread failure.

The two designs should be reviewed & merged.

Clustering design

Summary

Note: in this document the term "resource" means "Queue or Exchange".

A broker cluster provides the illusion of a single broker: any cluster member can serve clients for any clustered resource.

For each clustered resource a cluster broker has exactly one of these roles:

  • Primary: primary owner of the Queue/Exchange.
  • Backup: becomes primary if primary fails. <- comments - I would expect that you would want to be able to configure the number of backups used in a cluster per resource 1,2,3. Don't know if there is any benifit beyond 3) ->
  • Proxy: forwards to primary. <- assume by default, everyone that is not primary is a proxy in a cluster ->

This application is partitioned by assigning groups of queues/exchanges to primary brokers. The same broker can be primary for one exchange, backup for another, proxy for another etc.

Primary/backup failover provides software/hardware fault tolerance.

Proxies provide load distribution and act as connection concentrators.

TODO: There are many possible configurations for these roles, need to identify some basic use cases and show corresponding configuration.

Wherever it makes sense we will use AMQP itself to communicate within the cluster, using the following extension points to carry cluster data over standard AMQP: <- comment - thus AMQP for 1 to 1 and AIS for 1- n where n >=3 ->

  • Field table parameters to various AMQP methods (declare() arguments etc.)
  • Field table in message headers.
  • System exchanges and queues.
  • Alternate implementations of Queues, Exchanges, Bindings.

Sessions

With failover a single client-broker "session" may involve multiple connects/disconnects and multiple broker and/or client processes.

The client generates a UUID to identify each new session.

Extension points in the existing protocol are used to create a session:

Code Block

# AMQP methods to start a session.

# Client requests a timeout
client.start(server-properties={
  qpid.session.start:<uuid>
  qpid.session.timeout:<in seconds> })

# Broker may impose a lower the timeout.
server.start-ok(client-properties={qpid.session.timeout:<seconds>]})

The broker keeps the session for the timeout period if the client connection is unexpectedly closed, or if the client calls Connection.close() with reply-code=KEEP_SESSION (new code). Any other type of connection closure terminates the session.

If the broker is configured for HA then sessions will survive failover. If not, sessions will survive temporary client disconnect.

Code Block

# Client resuming a disconnected session:
client.start(server-properties={qpid.session.resume:<uuid>})

# Broker returns the status of the session.
server.start-ok(client-properties={qpid.session.resume:<session status>})

TODO define <session status>. Gives the status of the session as known to the broker - things like resumed channel ids, outstanding/unacked requests etc.

If the session has timed out or does not exist broker closes the connection with an exception.

Client Failover

Finding Alternate Brokers

The Connection.open-ok known-hosts parameter provides the initial list of alternate brokers. Note each "host" can be "host:port".

Client can bind a temporary queue to predeclared exchange "amq.qpid" with routing-key="cluster". The broker will publish a message whenever the alternate broker list changes. The message has empty body and header:

Code Block

  qpid.known.hosts = longstr{ shortstr host, shortstr host ... };

Choosing a Broker

Client chooses randomly among alternate brokers.

Sophisticated strategies for choosing brokers e.g. nearest-in-network, least-loaded etc may be considered in later versions.

Client resumes the session as described under "Sessions"

Clustered Resources

Each resource on a clustered broker falls into one of these categories:

  • Local: not visible to other cluster members.
  • Proxied: visible via all cluster members but not fault-tolerant.
  • Fault tolerant, transient: Mirrored to backup, not logged to disk.
  • Fault tolerant, durable: Mirrored and logged to disk.

Transient resources and transient messages survive failover to an active backup, but not failure of both primary and backup.

Durable fault-tolerant resources and persistent messages on durable, fault tolerant queues survive any failover including failure and restart of the entire cluster.

Local resources do not survive failure of their broker.

Proxied resources: when a broker joins a cluster it will create proxies for all clustered resources. Cluster brolkers automatically keep their proxy set up to date as brokers join & leave the cluster and as resource are created/destroyed on their primary broker.

Creating Clustered/Proxied Resources

A qpid client can explicitly request creation of cluster or proxy
resource by adding to the "arguments" parameter of declare methods.

Code Block

# Create a local proxy to a resource on <host>.
# Fail if the resource does not exist at host or host is invalid.
declare(arguments={qpid.cluster.proxy:<host>})

# Create a clustered resource on remote primary <host> and a local proxy.
# If host="" or the current broker then create a clustered resource
# with the local broker as primary.
declare(arguments={qpid.cluster.primary = <host>

Clustered resources can be declared on any clustered broker from any other clustered broker. Declarations are forwarded to the primary.

For compatiblility with non-Qpid clients a broker can be configured (via AMQP?) with a list of rules mapping queue/exchange names to names to primary hosts. If a normal declare() name matches such a rule the broker will either create a proxy to the resource or attempt to create the remote cluster resource based on rule configuration. TODO details.

Proxies

Proxies use only AMQP to communicate. They can be used indepedently of clustering as federated connection concentrators. They use standard AMQP on both sides so they can interoperate with non-Qpid clients and brokers.

Pure-proxy brokers can be used as connection concentrators and to cross network boundaries - e.g. TCP to IGMP forwarding, running in DMZ etc.

The basic idea is to create proxy exchange/queue instances with special proxy implementations to give the illusion of a common queue/exchange setup on each broker. The behavior of these proxy objects is as follows:

TODO: namespace issue, see Questions section below.

Publish to proxy exchange

Local proxy exchange acts as client to remote exchange. Any message published to proxy exchange is forwarded to remote exchange.

Proxy exchanges have the same name as their remote counterpart so they they can be used as if they were the remote counterpart.

Comment - To maintain QoS I assume the ansync reply as defined by the transport SIG would not be issued from the proxy to the publisher until successfully enqueued by the Primary.

Consume from proxy queue

Proxy queue consumes, gets and acknowledges messages from remote queue on behalf of its clients. It must not acknowledge any message until it has received acknolwedgement from its client. It should use flow control to avoid consuming messages unless it has a client ready to consume them.

TODO: draw out the scenarios for avoiding "stuck message syndrome" where a proxy queue consumes a message but then has nobody to pass it on to.

Comment - To maintain QoS I assume the dequeue from Primary would not be issued by the proxy until successfully receiveing the reply sequence from the consumer.

Bind local queue to proxy exchange

Proxy exchange creates private queue on remote broker and binds to remote exchange with identical binding arguments. Proxy exchange consumes from private remote queue and enqueues messages on the local queue.

Bind proxy queue to local exchange

The proxy queue acts as client to the remote broker. All messages delivered to the proxy queue are sent to the remote queue via the remote default exchange and default (queue name) binding.

Bind proxy queue to proxy exchange

Given: remote Queue Q on remote broker Bq, remote Exchange X on remote broker Bx. Local broker is B.

If Bq=Bx then simply bind on the remote broker.

If Bq or Bx is a Qpid broker then create a proxy on the Qpid broker for the resource on the non-qpid broker and bind it to the resource on the Qpid broker.

If neither Bq nor Bx is a qpid broker: bind the proxy queue to the proxy exchange (combine rules of two preceeding use cases.) This means traffic from Bx will be routed thru B and forwared to Bq. May or may not be a useful thing to do.

Resource lifecycle

Explicit resource delete commands are forwarded to the primary broker.

In a Qpid cluster, the cluster is notified via the cluster map (see below) so resources can be reclaimed on all proxy brokers.

If Qpid brokers act as proxies for non-Qpid brokers outside a cluster it may not be possible to immediately notify all the proxies. Therefore brokers must also clean up proxy resources if a NOT_FOUND exception indicates that a resource no longer exists, and should propagate the exception back to the client.

This could still leave dead proxies in idle brokers. Proxy brokers may periodically check the existence of resources they proxy for (using declare) and clean up if the proxied resource no longer exists.

Auto-delete queues: proxies for auto-delete queues should mimic their clients, in particular they must not keep the queue open for consume if they have no clients consuming. On the other hand they must not allow the queue to be deleted if they still have consuming clients.

Connection management

TODO: details:

  • Clients must be unaware of disconnect/reconnect by proxies.
  • Proxies shouldn't hold unused connections open for no reason.
  • Proxies shouldn't create/teardown connections too frequently.
  • Proxies should share connections where possible.

Cluster Communication

Broker sessions.

Sessions are considered resources by the brokers - each session has primary and backup brokers . The primary for a session is the broker where the session is created.

Unlike queues and exchanges, sessions cannot be proxied. In the event of failure sessions are tranferred to another broker. After a failure a client can re-establish the session on any cluster broker.

<- comment – how does the client know which node in the cluster to reconnect to, and how is the list of nodes in the cluster published to the client-- assume it use the system exchange ->

A session is active from the time a client connects up to the failure of the connection, the broker or the client. If the broker sees a connection failure, it marks the session disconnnected.

If a client attempts to resume the session somewhere in the cluster, the session is transferred to that broker and become active again. That broker becomes the primary broker for the session.

If a broker fails, all its sessions are tranferred to the backup and marked disconnected. As above a client can resume a session at any broker, the session will be transferred and activated.

Active sessions cannot be transferred - a client can only be connected to a session via one broker at a time.

Resource sets

A cluster broker can take different roles with respect to different resources (queues, exchanges, sessions). A "resource set" is a set of resources owned by the same broker. The resource set is the unit of transfer, for example if a primary broker fails its entire resource set is taken over by the backup.

TODO: Failover merges the primary's resource set with the backup. Do we also need to allow resource sets to be split so a primary broker can offload part of its resource set on another broker? - I would say yes, also on failure another backup need to be identified as one node got lost. also all back-ups on the node that was primary also need to be replicated to another node.

The cluster map.

The cluster map lists

  • all brokers that are members of the cluster.
  • all cluster resources with the primary and backup broker for each.

OpenAIS will be used to give all brokers a consistent view of the cluster map:

  • membership events (join, leave, failed.)
  • cluster map updates
  • detecting failed brokers.
  • resource changes (create, delete, transfer)

All other traffic between brokers will use AMQP. The majority of traffic can be done using the proxy techniques described above. Other communication is via system exchange and queues.

comment — if a high number of backups is configured it might be more effient to run AMQP over AIS for this distribution, however this can be seen as an optimization and done later.

TODO: is there a need for any other protocols?

Choosing primaries and backups

Options:

  • pre-configured static resources with fixed primary/backup.
  • client/admin specifies primary (and backup?) on resource creation.
  • dynamic: cluster automatically (re)assigns roles as brokers join.

Probably need a combination: a way to set up an initial static assignment, a way for admins to intervene and adjust assignments and a way for the cluster to respond dynamically to changes. At a minimum it must be able to respond to a node failure.

For example: brokers could be arranged in a ring with each broker being the backup fro the preceeding one. To add a new broker you choose a point in the ring to insert it and the adjacent brokers adjust their responsibilities accordingly. Implies we can transfer "backupship" as well as ownership.

comment – believe in any scheme we need to be able to transfer "backupship" as well as "ownership" as when a node in the cluster fails, both may be present on that node.

Broker Failover

Transient resources and transient messages survive failover to an active backup, but not failure of both primary and backup.

Durable fault-tolerant resources and persistent messages on durable, fault tolerant queues survive any failover including failure and restart of the entire cluster.

In active/active clustering the primary broker forwards all commands and messages (persistent and transient) to the backup. The backup keeps an in-memory representation of the state of the primary. It still acts as a proxy for resourced owned by the primary, it does not act on the primaries behalf while it is alive.

Supposed A is backed up by B, B is backed up by C. If broker A fails its primary and backup resource sets are transferred to broker B. B becomes primary for those resources formerly owned by A and becomes backup for those resources formerly backed up by A. C becomes backup for those resources formerly owned by A.

A clustered broker also writes a transaction log of commands related to durable resources and persistent messages. Thus if both primary and backup fail, or if the primary fails without any backup, another broker can be elected to take over by recovering the state of the resources from the log files.

TODO: possible optimization - forward commands without messages to backup store messages persistently. Backup keeps model of state but only loads messages if actually required by failover. Tradeoff normal-running performance/network load with recovery time.

TODO: Elections - detail scenarios and strategy, research what openAIS has to offer.

TODO: Transfer of state to the third broker C in the failover scenario. I.e. how does C become the backup for resources transferred from A to B? How does it catch up with the state of those resources?

Logs and persistence.

Use GFS or shared storage offered by the platfrom to share storage between brokers. Two types of storage to consider:

Transaction logs: Optimized for writing. A flat file with new entries always appended at the end. A single broker writes to each log file (minimise lock contention.) Not optimal for reading - need to read the entire sequence of log events to reconstruct the state of a broker. E.g. may include reading commands to create resources that subsequently get deleted. When the log reaches a certain size, start a new log file.

State snapshot: Image of the state of a set of resources at some point in time. More efficient for reading, more compact but slower to write because data structures must be reorganized.

A log consolidator reads from transaction logs and constructs a state snashot, deleting transaction logs as they are incorporated into the snapshot. So to do a full recovery you first load the snapshot, then apply the logs.

Questions

Lots more detail needed in failover. There are many ways you could configure these components, need to identify what makes sense, how much flexibility to offer and what simplifying assumptions we can make.

Need some profiling to see how various options perform in real life.

Split brain and quorums: How to achieve agreement that a broker is failed, e.g. using quorum device. Does open AIS solve this for us or is there more to it?

Elections & state tranfer: much more detail needed.

Naming problem for proxying and clustering. - Which names are local, which must be consistent across the cluster? - How/when to deal with clashes of local & remote names? - Use explicitly separate namespaces for proxy/cluster resources? - Use virtual hosts? All cluster/proxy resources live in a special virtual host? Relationship between FT storage and durable storage?

Proxies forward requests to primary only or also to hot backup?

UUID algorithm?