note · 2026-03-03

Stale-while-revalidate for embeddings: the corner cases

SWR works well for HTTP, less obviously for embeddings. A pragmatic walk through when it helps, when it bites, and what to do about model migrations.

#caching#invalidation#swr

The stale-while-revalidate pattern is well-loved in HTTP caching: serve the existing cached entry immediately, kick off a background refresh, return the fresh entry on the next call. It’s the cleanest answer to “don’t make the user wait for the cache miss.” For embeddings, it sometimes does what you want and sometimes does something subtly wrong. This is a note on which is which.

Where SWR fits cleanly

The case where SWR is uncontroversial: you’ve decided to migrate from embedding model A to embedding model B, and you’d rather not block your read path while ten million vectors get recomputed.

The setup looks like:

This is a sensible migration strategy. It avoids the “schedule a backfill, wait two days, cut over atomically” dance, at the cost of running both models for a while. The corner case it papers over — serving A’s vector when the caller asked for B — only matters if the downstream consumer (your retrieval ranker) treats vectors from different models as comparable. They are not comparable. So this only works if you also tag the served vector with the model it came from, and your retriever knows to fall back gracefully.

In practice, most retrievers don’t. So this version of SWR is really “serve B if you have it, otherwise serve nothing and enqueue.” That’s a fine pattern, but it’s not quite SWR — it’s lazy backfill.

Where SWR bites: the deterministic-output assumption

HTTP SWR assumes that a refresh might produce different bytes than the stale entry, and the whole point is to eventually surface the fresh bytes. Embeddings break this assumption in both directions.

If you embed the same text with the same model, the vector is deterministic (modulo ONNX precision edge cases). Refreshing the cache costs you the embedder call and produces the identical vector you already had. There’s no fresher truth — just a duplicate computation.

If you embed the same text with a different model, the vector is different, but it isn’t a “fresher” version of the old one; it’s a different vector in a different space. Calling it a refresh is a category error.

So when someone proposes “SWR for embeddings,” the first question is: what’s the staleness signal? In HTTP, it’s a TTL or a max-age. In embeddings, the only legitimate staleness signal is a model version change. Time alone doesn’t stale a vector.

The TTL trap

A common mistake: setting a TTL on embedding cache entries. The reasoning is “be safe — refresh occasionally in case the model changed.” The reality is:

TTL on embeddings is the wrong tool. The right tool is an explicit model version in the cache key. When you change models, you change the key, and every entry is a deterministic miss until it gets recomputed. You get a clean cutover under your control, not a stochastic smear.

The interesting corner case: chunker drift

Here’s where it gets uncomfortable. EmbedCache supports multiple chunkers — words, llm-concept, llm-introspection. The vector you get for a document depends on both the embedder and the chunker. The chunker is upstream of the embedding, and its output is not deterministic if it’s LLM-driven.

Run an LLM chunker over the same document twice and you may get two slightly different chunk sets. Two different chunk sets give you two different sets of vectors. Now your cache key needs to include not just the model and the text, but the chunker version and an attestation that the chunks came from the same chunker run.

The clean answer: cache the chunker output separately from the embedder output. The chunker takes (document, chunker_id) → list[chunk]. The embedder takes (chunk, model_id) → vector. Each has its own cache, each is content-addressed by its own input hash. The composition is deterministic if both pieces are.

The not-clean answer: cache (document, chunker_id, model_id) → list[vector] and accept that re-running an LLM chunker against the same document will produce a cache miss because the chunker output drifted. This is the practical default for most pipelines. The chunker drift shows up as a low-grade cache miss rate that nobody can quite explain.

What SWR actually buys you

Strip the pattern down. SWR for embeddings makes sense when:

  1. You have a model migration in flight and want to avoid a blocking backfill.
  2. Your read path can tolerate serving an old-model vector temporarily.
  3. You have a way to mark the migration done — usually a kill switch that disables the fallback once the cache is mostly populated.

It doesn’t make sense when:

  1. The two models produce vectors in incompatible spaces (almost always the case) and your retriever can’t handle the mix.
  2. You’re using TTL as a proxy for “should I refresh.” There is no refresh; there is only “did the model change.”
  3. Your chunker is LLM-driven and non-deterministic. SWR will not save you from chunker drift; only a chunker output cache will.

The pattern we actually recommend

For an EmbedCache deployment with model migration in mind:

This is less elegant than “just turn on SWR” — but it survives contact with a model migration, which the elegant version doesn’t.

The deeper point: HTTP-shaped caching patterns transplant poorly into embedding workloads because embeddings have an extra dimension — the model — that HTTP cache entries don’t. Most cache-design intuition treats freshness as a one-dimensional time axis. For embeddings, freshness is a two-dimensional space (time × model_version), and only one of those axes deserves invalidation. Get that one fact right and most of the corner cases collapse.


Written for the EmbedCache project by Skelf Research. Found a mistake? Open an issue.

← All notes