Architecture

Trogdor

The Fault Injector is named Trogdor.

Daemons

Trogdor has two Java daemons: an agent and a coordinator.

The agent process is responsible for actually implementing the faults.  For example, it might run iptables, send signals to processes, generate a lot of load, or do something else to disrupt the computer it is running on.  We run an agent process on each node where we would like to potentially inject faults.  So it will run alongside the brokers, zookeeper nodes, etc.

The coordinator process is responsible for communicating with the agent processes and for scheduling faults.  For example, the coordinator can be instructed to create a fault immediately on several nodes.  Or it can be instructed to create faults over time, based on a pseudorandom seed.

Both the coordinator and the agent expose a REST interface that accepts objects serialized via JSON.  There is also a command-line program which makes it easy to send messages to either one without manually crafting the JSON message body.

FaultSpecs

A FaultSpec is the specification for a fault.  For example, this fault spec describes a network partition:

{
"class": "org.apache.kafka.trogdor.fault.NetworkPartitionFaultSpec",
"startMs": 1000,
"durationMs": 30000,
"partitions": [["node1", "node2"], ["node3"]]
}

FaultSpecs should always be serializable to and from JSON.  Fault specs should be platform independent: notice that there was nothing about iptables, or Windows, or anything platform specific in the NetworkPartitionFaultSpec.

All fault specifications have:

In general, we plan on customizing FaultSpecs to Apache Kafka.  So there should be a fault spec for cutting off communication with ZooKeeper, a fault spec for suspending the active controller, and so forth.

Platform

Trogdor's Platform class encapsulates supports for a particular computing environment.

There are at least two important computing environments for us:

Node

A node is an individual element of a cluster.  Each node has a unique name.  Currently, we assume that all the nodes in the cluster are known ahead of time, and that a trogdor agent process is running on each node.

Fault

Fault objects implement particular induced problems.  Each fault object has:

Unlike the FaultSpec class, the Fault class is platform-specific.  For example, a network partition fault may be implemented differently on the cloud versus on a local Linux environment, even though the specification is the same.

It makes sense to use composition to create more specific faults out of more general faults.  For example, a fault for suspending the active Kafka controller could be built off of a fault for suspending a generic process.  A fault for partitioning the ZooKeeper nodes from the broker nodes could be built on top of the generic network partition fault.

Code

The code for Trogdor lives under ./tools/src/main/java/org/apache/kafka/trogdor.

There are several directories:

Ducktape Integration

What if you want to perform fault injection as part of a system test?  For this case, we need ducktape integration.

TrogdorService

There is a TrogdorService which will handle starting the Trogdor coordinator and Trogdor agents for you.  This is similar to the existing KafkaService, ZooKeeperService, etc. classes which handle starting other daemons.

However, unlike other service daemons, TrogdorService operates on nodes which already have a service running.  That is why the constructor of TrogdorService takes an agent_nodes parameter.  TrogdorService will start Trogdor agents on the agent_nodes that are passed in, but it will not take "ownership" of these nodes.  The services which originally allocated those nodes (KafkaService, ZookeeperService, etc.) are responsible for de-allocating them.  For this reason, TrogdorService should be shut down before other services.

TrogdorService allocates a single node of its own for the Trogdor Coordinator.  As mentioned earlier, it is better to run the coordinator on its own node so that it is not affected by the faults that it is creating elsewhere in the cluster.

Changes to AK Ducktape Services

Some of the ducktape service implementations assume that they are the only thing running on their node.  They make assumptions which generally fall into two categories:

We need to fix these services so that TrogdorAgent can run on the same node.  The fixes are mostly easy:

Upgrading to the Latest Ducktape

There has not been a new ducktape release in a while.  Apache Kafka is still using the 0.6.0 release of ducktape, which was made in mid-2016.

To upgrade AK to the latest ducktape, we need several things

Kibosh

Overview

Kibosh is a fault-injecting filesystem for Linux.  It is written in C using FUSE.

Kibosh acts as a pass-through layer on top of existing filesystems.  Usually, it simply forwards each request on to the underlying filesystems, but it can be configured to inject arbitrary faults.

Injecting Faults

Faults are injected by writing JSON to the control file.  The control file is a virtual file which does not appear in directory listings, but which is used to control the filesystem behavior.

Example

# Mount the filesystem.
$ ./kibosh /kibosh_mnt /mnt

# Verify that there are no faults set
$ cat /kibosh_mnt/kibosh_control
{"faults":[]}

# Configure a new fault.
$ echo '{"faults":[{"type":"unreadable", "code":5}]}' > /kibosh_mnt/kibosh_control


Code

Kibosh is written using the Linux kernel C coding style.  The only exception is that tabs are 4 spaces and do not use a hard tab character.

We try to minimize the native dependencies that are used.  Kibosh has three dependencies:

For convenience, the source code for json-parser is included in the main repo as json.c and json.h.  It is BSD licensed.

Code here: https://github.com/confluentinc/kibosh

Pull Requests