DiskStorageLayout

Unfortunately the original design documents for traffic server were lost, so no set of comprehensive documentation exists. This document will attempt to ameliorate that loss by giving some basic background. Specific questions can be asked on dev@trafficserver.apache.org and will be answered by extending this document (at least that is the idea).

Overview

The disk cache is structured as a cyclone cache: a circular buffer where new documents overwrite the oldest documents. Documents are broken into "fragments" (confusingly the header for each document or fragment is a "struct Doc") no bigger than a certain size (currently 1MB), a size which amortizes seek overhead while keeping the amount of memory used by a connection (which will buffer/load an entire fragment at once) down.

Documents which have multiple fragments are laid out such that the data fragments are written first and the "head" is written last. When such a document is read, both the "head" and the "earliest" data block are validated (tested to ensure that they haven't been overwritten by the write cursor) to verify that the entire document is present on disk (as they bookend the fragments).

Documents which are in use (being read) or "pinned" into the cache can't be overwritten, so they are "evacuated" from in from of the write cursor. Each fragment is read and rewritten after the write cursor. The index is updated for all fragments at the time they are written save the "earliest" fragment. A "lookaside" directory holds the "earliest" key so that it can be found by readers, but it is not written to the index until after the "head" fragment is written. This ensures that either the entire document was evacuated or it will be lost (not corrupted) in the case of a crash.

The index takes a 128 bit "key" to a fragment. The fragment has a "Doc" header which tells the cache whether it is a "head" or a single fragment document. Given a key, the next key in the document is computed deterministically using the "next_CacheKey()" function.

The index is stored entirely within memory, and requires 10 bytes per fragment. This is necessary to permit zero seek cache misses. The combination of zero seek cache misses, RAM cache for popular documents and write aggregation which groups writes dramatically reduces the number of seeks required over a naive file-per-document approach.

The index is stored on disk in copies (A and B), one of which is "clean" and the other of which is being written from memory. The index is broken up into buckets (currently 4 directory entries per bucket) which are in turn broken up into "segments" of 64K directory entries. Each segment has a freelist of directory entries. The net effect of this is that with very high expectation, we can resolve a hit or a miss within one L1 cache line (the 4 entries in the bucket), and with high probability, we will not overflow the index because of collisions until we have very high occupancy (>90%).

Cache Consistency

The cache is completely consistent, up to and including kicking the power cord out, if the write buffer on consumer disk drives is disabled. You need to use:

hdparm -W0

The cache validates that all the data for the document is available and will silently mark a partial document as a "miss" on read. There is no "gentle" shutdown for traffic server, you just kill the process, so the "recovery" code (fsck) is run every time traffic server starts up.

On startup the two versions of the index are checked, and the last valid one is read into memory. Then traffic server moves forward from the last snapped write cursor and reads all the fragments written to disk, and updates the directory (as in a log-based file system). It stops reading at the write before the last valid write header it sees (as a write is not necessarily atomic because of sector reordering). Then the new updated index is written to the invalid version (in case of a crash during startup) and the system starts.

Space shortcuts

Page tree

Overview

Cache Consistency