Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: misc cleanup

...

Here is some information on actually running Kafka as a production system. This is meant as a page for people to record their operational and monitoring practices to help people gather knowledge about successfully running Kafka is in production. Feel free to add a section for your configuration if you have anything you want to share. There is nothing magically about most of these configurations, you may be able to improve on them, but they may serve as a helpful starting place.

...

The disk throughput is important. We have 8x7200 rpm SATA drives in a RAID 10 array. In general this is the performance bottleneck, and more disks is more better. Depending on how you configure flush behavior you may or may not benefit from more expensive disks (if you flush often then higher RPM SAS drives may be better).

OS Settings

We use linuxLinux. Ext4 is the filesystem and we run using software RAID 10. We haven't benchmarked filesystems so other filesystems may be superior.

...

Our monitoring is done though a centralized monitoring system custom to LinkedIn, but it keys off the JMX stats exposed from kafkaKafka. To see what is available the easiest thing is just to start a kafka Kafka broker and/or client and fire up JConsole and take a look.

...

Zookeeper is essential for the correct operation of Kafka. There are a number of things that must be done to keep zookeeper Zookeeper running happily as we have learned the hard way, hopefully Dave and Neha will add this since I don't know what we did.

...

  1. Redundancy in the physical/hardware/network layout: try not to put them all in the same rack, decent (but don't go nuts) hardware, try to
    keep redundant power and network paths, etc
  2. I/O segregation: if you do a lot of write type traffic you'll almost definitely want the transaction logs on a different disk group than
    application logs and snapshots (the write to the zookeeper Zookeeper service has a synchronous write to disk, which can be slow).
  3. Application segregation: Unless you really understand the application patterns of other apps that you want to install on the same box, it
    can be a good idea to run zookeeper Zookeeper in isolation (though this can be a balancing act with the capabilities of the hardware).
  4. Use care with virtualization: It can work, depending on your cluster layout and read/write patterns and SLAs, but the tiny overheads
    introduced by the virtualization layer can add up and throw off zookeeperZookeeper, as it can be very time sensitive
  5. Zookeeper configuration and monitoring: It's java, make sure you give it 'enough' heap space (We usually run them with 3-5G, but that's
    mostly due to the data set size we have here). Unfortunately we don't have a good formula for it. As far as monitoring, both JMZ and the 4
    letter commands are very useful, they do overlap in some cases (and in those cases we prefer the 4 letter commands, they seem more
    predictable, or at the very least, they work better with the LI monitoring infrastructure)
  6. Don't overbuild the cluster: large clusters, especially in a write heavy usage pattern, means a lot of intra cluster intracluster communication
    (quorums on the writes and subsequent cluster member updates), but don't underbuild it (and risk swamping the cluster).
  7. Try to run on a 3-5 node cluster: Zookeeper writes use quorums and inherently that means having an odd number of machines in a cluster.
    Remember that a 5 node cluster will cause writes to slow down compared to a 3 node cluster, but will allow more fault tolerance.

Overall, we try to keep the zookeeper Zookeeper system as small as will handle the load (plus standard growth capacity planning) and as simple as possible. We try not to do anything fancy with the configuration or application
layout as compared to the official release as well as keep it as self contained as possible. For these reasons, we tend to skip the OS packaged versions, since it has a tendency to try to put things in the OS standard hierarchy, which can be 'messy', for want of a better way to word it.