Background

These are the lecture notes I wrote for a guest lecture in Fall 2022 for a Brown University Software Engineering course. I wanted to introduce basic concepts in caching and then CDNs as a way to explore challenges in software engineering related to design, testing and performance.

Caching

Caching is one of the simplest and most widely applicable algorithms I’ve run into. It’s the idea that if you use something often enough, it’s worth it to give up some space to keep it close at hand. You’re making a tradeoff between time and/or effort required to get the item vs using the easily accessible space to store it in. Let’s talk about some examples of what different kinds of caches look like:

Examples

Books - my bedside table is much smaller than my bookcase, but it’s worthwhile to put the books I’m currently reading there so I don’t have to get out of bed when I want to read them.
Operating system page cache - computer architectures typically include inexpensive slow “cold” storage media (solid state drives, spinning disks) and more expensive and fast media - Random Access Memory (RAM). When computer programs work with data, everything has to be mapped into the memory space of the running process in fixed-size units called “pages”. To speed this up, your operating system opportunistically stores these pages in the “page cache”, a fixed size region also in RAM.
CDNs - Content Distribution Network - intermediary caches where requests to your website will go first. If the request result has been looked up before at the cache, it will be available to others. CDNs typically have a large global distribution, they allow their customers (people running websites) to store content closer to their users to handle much more traffic that their server (”backend”) is able to. Effectively, you’re renting space on someone else’s computers, similar to cloud computing.

Correctness

What does it mean for a cache to be correct? A cache is a form of storage. There are a bunch of these: filesystems, databases, hashmaps, email inboxes. They all require that objects stored in them are retained and able to be accessed again. This isn’t actually true for a cache - it’s a best effort system with limited space. It’s free to not store or delete as much as it wants. What we think about as the “correctness” problems for a cache are:

Will this result in a user receiving incorrect information?
Will this result in a user seeing someone else’s information?

The downside of something not being in a cache is that you have to work slightly hard to go and get it somewhere less convenient. i.e. I have to walk to my bookshelf, you won’t ever get wrong information. However, caching something not meant to be cached in a CDN (for example, the personal details of the previous person who accessed your site) is really bad. The information is incorrect for all other users and risks leaking private information. And, like other forms of storage, the things stored in it should not spontaneously change (i.e. experience data corruption). This would result in incorrect information for users. Users should only be able to access the information they stored in the cache. Notice it’s not ‘always’ - it doesn’t need to be there. But they shouldn’t get something they’re not allowed to see.

What is the simplest implementation of a totally correct cache?

An empty cache that admits no objects. But that’s not very useful!

Efficiency

So, we’ve talked about what it means for a cache to be correct, but that’s not the same as it being actually helpful. What does it mean for a cache to be good?

When you look something up, it’s usually there. If it’s never there, you’re wasting time and might as well just always use the other source.
You can look for an object and get it out of the cache quickly. If it’s slower than the other source, you might as well just use that.

These are the specific terms used to describe those concepts:

Cache Hit Ratio (CHR) - this is the percentage of the time that I look for an item in the cache and it’s there for me (a HIT). Larger is better.

HIT Latency - how quickly do we get back the object when it’s in the cache?

MISS Latency - how quickly do we get back the object when it’s not in the cache? We might tolerate a slower time here, because ideally it will be an abnormal case. We have to both get the object and store it in the cache, but future lookups should get the benefit of the HIT latency.

The CHR a cache achieves depends a lot on how it’s being used. The main external variables in question are:

Working Set Size - this is the size of the set of objects that tend to be accessed through the cache. If the working set size fits within the whole cache, we expect the CHR to be 100.

Object Size - how big are the objects themselves? You can fit more, smaller objects in a cache. Are the objects all the same size? Unevenly sized objects can result in large objects taking up more of the cache than we might want.

Access Patterns In what order and with what frequency are objects accessed? Caches with a large working set size can still be quite effective (have high CHRs) if object popularity has certain distributions or different objects tend to be accessed at different times. For example, it takes me a couple weeks to finish books and I keep reading the same ones until I’m finished. Therefore, my CHR for my bedside table is very high, even if I own many books and have a small bedside table. I only MISS when I finish a book and start a new one. However, if my bedtime reading pattern was instead random chapters from a different random book every night, my expected CHR would only be |side table| / |books|. When designing a cache, you have to keep these variables in mind and tailor it for the types of workloads you’re expecting.

So how can we make a cache work better?

Increase the size of the cache
Only store really useful items
Better data structures

Cache Size - how much space we have available in the cache. Intuitively, a larger cache tends to work better. But caches are usually expensive to operate.

Cache Admission Policy - what goes in the cache in the first place? Most caches have a pretty liberal admission policy and rely on the replacement policy to winnow out undesirable objects, but the admission policy could do things like set a maxium object size, or look at other characteristics of the object that indicate whether or not it’s likely to be accessed again.

Cache Replacement Policy - when the cache fills up, what do we kick out? The most common strategies here are Least Recently Used (LRU) and Least Frequently Used (LFU).

LRU - When an object is accessed, move it to the front of the queue. When it’s time to evict, delete the object at the end of the queue.
LFU - Keep a counter with each object, tracking the number of times it’s been used. Keep the objects in a data structure that it’s easy to keep sorted - min heap.

The effectiveness of these policies unfortunately does still depend on the access pattern. Neither is always going to be successful. They’re both trying to make predictions about what’s going to happen in the future based on the path. LRU makes the bet that objects are temporally popular and the recent access implies that it will be used again soon. But what about something that I use infrequently but consistently.

Data Structures - give the object identity (title, address, url), we need to be able to quickly locate it. The best bet is a hash table. However, in order to make efficient use of space, we need to also be able to delete things quickly once they become stale. If we’re using LRU, we need to use a data structure that is amenable to being sorted based on access frequency.

CDNs

As we discussed, a CDN is a specific type of cache. The design and implementation need to take into account the characteristics of the wider system it’s operating in, i.e. the internet. You could imagine caches operating at various levels of the networking stack, but CDNs work at the HTTP level. This means they operate in terms of HTTP Request and Responses and they’re aware of the HTTP protocol and need to obey the HTTP specifications. In fact, caching is such an important part of web traffic, the HTTP specifications have been extended to include a number of features designed to take advantage of and customize caching behavior.

Cacheability

The first piece of information we need to know about caching is how to determine if an object is cacheable (allowed to be stored in a cache or not). HTTP provides a list of “heuristics” - baseline characteristics and the Cache-Control header allows further customization. Cache-Control Header The Cache-Control header can actually be present on both Requests and Responses and it’s contents (referred to as caching “directives”) represent instructions to “shared caches” (such as CDNs) and “private caches”, like your personal web browser. For the purposes of the rest of this talk, I’ll be discussing Cache-Control headers present on Responses (the data returned from a server) and the caching behavior of CDNs. The RFC states:

A cache MUST NOT store a response to a request unless:

the no-store cache directive is not present in the response (see Section 5.2.2.5);
the private response directive is either not present or allows a shared cache to store a modified response; see Section 5.2.2.7);
the Authorization header field is not present in the request (see Section 11.6.2 of [HTTP]) or a response directive is present that explicitly allows shared caching (see Section 3.5); and
the response contains at least one of the following:
- a public response directive (see Section 5.2.2.9);
- a private response directive, if the cache is not shared (see Section 5.2.2.7);
- an Expires header field (see Section 5.3);
- a max-age response directive (see Section 5.2.2.1);
- a status code that is defined as heuristically cacheable (see Section 4.2.2).

Testing

This specification is really important! It’s how we avoid incorrectly caching things that should not be cached. So, we should definitely test it. But this specification is pretty non-trivial and covers a lot of possible states. The first option for testing is to enumerate different possibilities and assert that they behave correctly. The presence of “no-store” is a binary condition, it’s either there or it isn’t. The same goes for private, public, Authorization, Expires and max-age. But all of those can be present at the same time! That’s 2^5 = 32 test cases. And then we have status codes. There are 12 status codes that are “heuristically cacheable” (200, 203, 204, 206, 300, 301, 308, 404, 405, 410, 414, and 501), but there are 63 that have assigned meanings (and up to 600 possible). 32 * 63 = 2016! I don’t want to write that many tests!

There are a lot of situations like this in software engineering - you have a huge space of possible inputs and it’s not feasible to write an explicit test case for each of them. We can do “example testing” where we write out an example scenario and state the exact way we expect it to behave. You can pick “representative cases” and try to think of interesting edge cases, but it’s inherently limited in scope. We can also do fuzz testing which generates a huge amount of random inputs and fails if the program panics or crashes in a bad way. The covers a much wider space, but the testing is much less nuanced - “crash or not crash” misses a bunch of other things which might be going wrong.

A testing method which combines these two strategies is called “property based testing”. It was popularized with a Haskell framework called QuickCheck, but has been ported to many different software stacks. The idea is that, like fuzz testing, you programatically generate a wide range of inputs and feed them to your software. Then, you have another program that accepts the results and makes assertions about their “properties”.

Then, as part of my test, I can pass such data as needed to make the determination (Response, Cached?) to a function that evaluates the rules of the RFC and tells me if it was right or not. Another really cool feature of property-based testing frameworks is that they iterate to find the “minimal reproducing test case”. Running these tests consistently is really helpful in detecting whether or not new bug fixes or features unintentionally changed this behavior.

I’ll go on to talk about some more features of HTTP caching and CDNs and how, like any new feature, they always add complications.

Expiration

Imagine you have a website that posts weather forecasts that are recomputed every hour. The webpage contents remain the same for an hour, but it is updated. The content is “static-ish” since it changes pretty infrequently. To take advantage of a CDN, but avoid serving your users out of date weather information, you can assign an expiration value to your HTTP responses using the Cache-Control: max-age header. This means we have a new property that we need to enforce for our cache. We can store the data for at most the number of seconds specified.

How should we implement object expiration?

1st Defintion: Object expires at “object lookup time” + max-age.

Slow backends

External servers can take a long time to respond and this first attempt produces bad results in the following scenario:

Request object at time T1
It’s a MISS, forward to backend
Server is slow to respond and sends object with max-age=1
Object is expired before it enters the cache!

2nd Definition: Object expires at “cache admission time” + max-age

Sharding

One of the main levers CDNs have to get better CHRs for users is having a larger cache. There are inherent scaling limitations for a single server - you need storage space (disks) for the objects and you need RAM for your data structures that allow you to find the objects. A solution to this is “horizontal scaling” - add more computers to a local network and use them collectively as the cache. This allows you to store way more objects but also to handle way more total requests.

The way it works is that when a request arrives at the cache, it first arrives on a random node which uses a hashing function to determine which node is the “primary” for that object and look for it there. If it’s there, we use it. If not, we make a request to the website’s server to get the original. We also store a copy of the object on the node that it was requested from, so that the burden of very popular objects gets spread across the cluster. There’s a fair amount to get right here. To gracefully accommodate various types of failures, the cluster nodes need to be aware of the health status of all the others and if the “primary” for an object goes away, we need a hash function that can calculate a new primary, without redistributing all the keys. These are big themes that show up in Distributed Systems works generally.

Now that our software is running across a collection of computers, it’s also now a lot harder to test. Here’s an example of yet another bug with the expiration implementation that we didn’t notice until we started using property-based testing:

Request object at time T1
Stored on primary, expiration time = T1 + max-age.
Request object from non-primary node at time T2
Stored on non-primary, expiration time = T2 + max-age

In other words, the object’s expiration time is refreshed eveytime it’s propogated to a member of the cluster and it can live much longer than the original max-age.

3rd Definition: Object expires at “backend response time” + max-age

Conclusions

Caches are truly everywhere in software systems. Becoming familiar with them will give you an intuition about a lot of types of problems. Evening just clearing your browser cache when a page won’t update.

When building products, it’s critical to figure out what it’s required for something to be correct. Next, figure out what it means for it to be good. Explore the usage patterns and system constraints that contribute to performance and explore what levers you have available to improve things. This could be adding system resources, more efficient algorithms and picking the right data structures. Finally, every new feature has the possibility of negatively interacting with every other existing feature. Figure out a test plan that gets you coverage without exponential work.