Stale-while-revalidate for embeddings: the corner cases
SWR works well for HTTP, less obviously for embeddings. A pragmatic walk through when it helps, when it bites, and what to do about model migrations.
The stale-while-revalidate pattern is well-loved in HTTP caching: serve the existing cached entry immediately, kick off a background refresh, return the fresh entry on the next call. It’s the cleanest answer to “don’t make the user wait for the cache miss.” For embeddings, it sometimes does what you want and sometimes does something subtly wrong. This is a note on which is which.
Where SWR fits cleanly
The case where SWR is uncontroversial: you’ve decided to migrate from embedding model A to embedding model B, and you’d rather not block your read path while ten million vectors get recomputed.
The setup looks like:
- Your cache key includes
model_id. The entries for A and B are distinct rows. - Your read path asks for B. If the row exists, serve it. If not, serve A’s row as a fallback and enqueue a background job to compute B’s row.
- Over time, the working set migrates. Cold rows that nobody asks for stay on A; hot rows migrate first.
This is a sensible migration strategy. It avoids the “schedule a backfill, wait two days, cut over atomically” dance, at the cost of running both models for a while. The corner case it papers over — serving A’s vector when the caller asked for B — only matters if the downstream consumer (your retrieval ranker) treats vectors from different models as comparable. They are not comparable. So this only works if you also tag the served vector with the model it came from, and your retriever knows to fall back gracefully.
In practice, most retrievers don’t. So this version of SWR is really “serve B if you have it, otherwise serve nothing and enqueue.” That’s a fine pattern, but it’s not quite SWR — it’s lazy backfill.
Where SWR bites: the deterministic-output assumption
HTTP SWR assumes that a refresh might produce different bytes than the stale entry, and the whole point is to eventually surface the fresh bytes. Embeddings break this assumption in both directions.
If you embed the same text with the same model, the vector is deterministic (modulo ONNX precision edge cases). Refreshing the cache costs you the embedder call and produces the identical vector you already had. There’s no fresher truth — just a duplicate computation.
If you embed the same text with a different model, the vector is different, but it isn’t a “fresher” version of the old one; it’s a different vector in a different space. Calling it a refresh is a category error.
So when someone proposes “SWR for embeddings,” the first question is:
what’s the staleness signal? In HTTP, it’s a TTL or a max-age. In
embeddings, the only legitimate staleness signal is a model version
change. Time alone doesn’t stale a vector.
The TTL trap
A common mistake: setting a TTL on embedding cache entries. The reasoning is “be safe — refresh occasionally in case the model changed.” The reality is:
- If the model didn’t change, the refresh is wasted compute.
- If the model did change, every entry will refresh at a random time within the TTL window, smearing the model-migration cost across a rolling window instead of concentrating it where you can manage it.
- The downstream retriever ends up with a mix of old-model and new-model vectors during the smear, which is the worst of both worlds.
TTL on embeddings is the wrong tool. The right tool is an explicit model version in the cache key. When you change models, you change the key, and every entry is a deterministic miss until it gets recomputed. You get a clean cutover under your control, not a stochastic smear.
The interesting corner case: chunker drift
Here’s where it gets uncomfortable. EmbedCache supports multiple
chunkers — words, llm-concept, llm-introspection. The vector you
get for a document depends on both the embedder and the chunker. The
chunker is upstream of the embedding, and its output is not
deterministic if it’s LLM-driven.
Run an LLM chunker over the same document twice and you may get two slightly different chunk sets. Two different chunk sets give you two different sets of vectors. Now your cache key needs to include not just the model and the text, but the chunker version and an attestation that the chunks came from the same chunker run.
The clean answer: cache the chunker output separately from the embedder
output. The chunker takes (document, chunker_id) → list[chunk]. The
embedder takes (chunk, model_id) → vector. Each has its own cache,
each is content-addressed by its own input hash. The composition is
deterministic if both pieces are.
The not-clean answer: cache (document, chunker_id, model_id) → list[vector] and accept that re-running an LLM chunker against the same
document will produce a cache miss because the chunker output drifted.
This is the practical default for most pipelines. The chunker drift
shows up as a low-grade cache miss rate that nobody can quite explain.
What SWR actually buys you
Strip the pattern down. SWR for embeddings makes sense when:
- You have a model migration in flight and want to avoid a blocking backfill.
- Your read path can tolerate serving an old-model vector temporarily.
- You have a way to mark the migration done — usually a kill switch that disables the fallback once the cache is mostly populated.
It doesn’t make sense when:
- The two models produce vectors in incompatible spaces (almost always the case) and your retriever can’t handle the mix.
- You’re using TTL as a proxy for “should I refresh.” There is no refresh; there is only “did the model change.”
- Your chunker is LLM-driven and non-deterministic. SWR will not save you from chunker drift; only a chunker output cache will.
The pattern we actually recommend
For an EmbedCache deployment with model migration in mind:
- Key cache entries on
(content_hash, model_id, chunker_id, chunker_version). All four. - Treat “miss because model_id changed” as a deterministic event, not a staleness event. Plan the migration as a backfill job, not as organic refresh traffic.
- If you need read-path availability during migration, run a fallback
query against the previous
model_idrow and serve it with an explicit “served from previous model” tag in the response. The retriever decides whether to use it. - Don’t put TTLs on cache rows. Put a
created_atand anembedder_ fingerprintso you can audit drift, but let entries live until explicitly invalidated.
This is less elegant than “just turn on SWR” — but it survives contact with a model migration, which the elegant version doesn’t.
The deeper point: HTTP-shaped caching patterns transplant poorly into
embedding workloads because embeddings have an extra dimension — the
model — that HTTP cache entries don’t. Most cache-design intuition
treats freshness as a one-dimensional time axis. For embeddings,
freshness is a two-dimensional space (time × model_version), and only
one of those axes deserves invalidation. Get that one fact right and
most of the corner cases collapse.
Written for the EmbedCache project by Skelf Research. Found a mistake? Open an issue.