Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Added suggestion from Taylor Gautier highlighting Kafka's lack of a message index, since it's a misconception many might share. (I also thought for a while that getOffsetsBefore() was a lot more precise than it really is).

...

  • Kafka has topics, and topics have numbered partitions starting from 0. A topic can be created at runtime just by writing to it, but the number of partitions per topic is determined by broker configuration.
  • Kafka stores messages on disk, in a series of large, append-only log files broken up into segments. Each topic+partition is a directory of these segment files. For more details, see What are Segment Files.
  • An offset is just the byte offset in a given log for a topic+partition. The messages don't have any other unique identifier. They're simply stored back to back in the segment files, and you ask for them by their byte offset.
  • Kafka tends to do the simplest thing possible and relies on smarter clients to keep bookkeeping. When producing messages, the driver has to specify what topic and partition to send the message to. When requesting messages, the driver has to specify what topic, partition, and offset it wants them pulled from.
  • While you can request "old" messages if you know their topic, partition, and offset, Kafka does not have a message index. You cannot efficiently query Kafka for the N-1000th message, or ask for all messages written between 30 and 35 minutes ago.
  • Kafka tends to do the simplest thing possible and relies on smarter clients to keep bookkeeping. The broker does not keep track of what the client has read. More advanced setups use ZooKeeper to help with this tracking, but that is currently beyond the scope of this document.
  • The protocol is a work in progress, and new point releases can introduce backwards incompatibile changes.
  • The broker runs on port 9092 by default.

...