The current state of affairs when it comes to cache lookup URLs is quite a mess. This document describes the outline of the changes I'd like to make. Part of this includes eliminating a few APIs, which depend on a strange concept of a 2nd cache state machine. The proposal includes the alternatives I have for this, but bear in mind that the implementation of those alternatives belongs in different project plans.

The problem

There are a number of issues with how we deal with cache URLs in the code right now:

  • There are a number of URL objects that are setup to store potential cache keys. It's quite a mess, and some of them are already dead code. Some of them come from the 2nd cache state machine (more details later). This makes the code overly complicated, and suboptimal.
  • The current API (TSCacheUrlSet()) to modify the cache key is inflexible and slow. The reason for this is two fold:
    1. The API takes a string, which has to be parsed as a URL creating a new URL object internally. This is not only slow, but means the cache key has to be a parseable URL.
    2. To create the cache key (an INK_MD5), we have to stringify this URL object again (undoing the parsing), and do an MD5 over this string. These are simply wasted cycles.
  • There is no easy way to make a small modification to the cache key. You have to create an entirely new URL cache key.

Proposal

I'm breaking this proposal up into three sections, since it involves pretty invasive changes to not only the core, but also the elimination of an experimental API.

API additions

We will use the term "Key" or "Cache Key" to mean the original data (by default the URL) that externally identifies the object. The term "ID" will be used for the post-hash data used internally to identify objects in the cache.

A type to represent the post-hash data. This should include a data pointer (either void* or uint8_t*) and a size indicator (an enumerated type or a literal data length).

TSCacheID

The following APIs would be added:

  TSReturnCode TSHttpTxnCacheKeySet(TSHttpTxn txnp, void *data, size_t len);
  TSReturnCode TSHttTxnpCacheKeyUpdate(TSHttpTxn txnp, void *data, size_t len);
  TSCacheID* TSHttpTxnCacheIDGet(TSHttpTxn txnp);
  TSReturnCode TSCacheKeyGenerate(TSCacheID *result, void *data, size_t data_len);

The Set() API will perform the internal hashing function of the data, replacing the old cache key for the transaction completely. The Update() API will modify the currently active cache key, defaulted by the core as a hash of the remapped request URL, by in effect appending the data provided to the original cache key. This is the best way to make incremental changes to the cache key, e.g. in a plugin implementing generation IDs over a set of URLs. IDGet() will return the result of hashing the key. Generate() exposes the hashing function. It computes the hash ID based on the input and stores it in the TSCacheID.

NOTE: We also need a way to optionally preserve the data used to generate the cache key for debugging purposes. Users that manipulate the cache key (eg. using the cacheurl plugin) need to be able to verify their changes.

NOTE: We should also study the TSCacheKey API and determine whether it makes sense to extend or repurpose it.

API removal

I'm proposing that we eliminate the 2nd cache state machine, and all the APIs related to this. This includes

  tsapi TSReturnCode TSHttpTxnNewCacheLookupDo(TSHttpTxn txnp, TSMBuffer bufp, TSMLoc url_loc);
  tsapi TSReturnCode TSHttpTxnSecondUrlTryLock(TSHttpTxn txnp);

There's currently only one plugin using these APIs, the rfc5861 plugin. Instead of the old functionality, I propose we make two changes to the core (and additional new APIs):

  • Allow the cache core to tag a cache entry to be served stale, and for how long. This functionality would be exposed through three paths:
    1. Explicitly set via a plugin using new APIs.
    2. Implicitly set via the stale-while-revalidate Cache-Control: header as specific in RFC5861. This could be overridden by a plugin above.
    3. Defaulted via records.config and/or hosting.config settings. If in records.config, it could be configured per-remap via conf_remap plugin.
  • Allow the HTTP State Machine to restart the cache-sm any number of times (not just once or twice). This requires both changes to the core, and additions to the APIs. It's a more generic implementation to the current 2nd cache state machine.

It's not within the scope of this cleanup project to implement both of these new features.

Code cleanup

There's a large amount of code around this. I'm suggesting we remove the 2nd cache state machine, and reduce the (fairly large) number of URL objects stored. Instead, we store a single INK_MD5 for the cache key, and keep it updated through the state-machine as well as via APIs. I will also provide benchmarks on the effects of this once completed, but we have some anecdotal evidence how expensive modifying the cache key is today via the current API.

We should also investigate how the existing cache key (TSCacheKey, for non-HTTP caches) can possibly interact with this.

  • No labels