Content-hash keying vs LRU: which actually sav…

There is a quiet, recurring argument in pipeline reviews about how to cache embeddings. One side says: hash the input bytes, key the cache on the hash, never evict. The other side says: use an LRU with a sensible ceiling, evict cold entries, keep the working set small. Both sides will quote hit ratios at you. Both sides are sometimes right.

This is a note on which one to pick, and why the answer is almost always “content-hash, no eviction” for the workloads people actually have.

What each strategy is actually doing

A content-hash cache keys entries on sha256(normalized_text) || model_id. The store is keyed by content. Two callers asking for the same embedding hit the same row regardless of who asked first or when. Eviction is optional and usually not needed; if the corpus is stable, the cache plateaus at “one entry per unique chunk in the corpus” and stays there forever.

An LRU cache keys entries on whatever the caller hands it — often the raw text, sometimes a normalized variant. The store is bounded by entry count or byte budget. When the budget is exceeded, the least-recently-used entry is dropped. Two callers asking for the same embedding hit the same row only if the second caller arrived before the first one’s entry got evicted.

These are structurally different caches. The content-hash one remembers. The LRU one summarizes recent traffic.

The case for LRU (when it actually fits)

LRU is the right answer for one specific shape: chat-style workloads where users embed novel inputs constantly and the hot set is small. Think: a semantic-search box where most queries are typos of “pricing,” “login,” and “where is my order.” You want the hot 200 queries memoized, you don’t care about the long tail, and you don’t want the cache to grow to gigabytes because someone pasted a novel in.

LRU is also the right answer when your storage budget is genuinely constrained — say, you’re running a sidecar with 50MB of RSS — and the queries-per-minute rate is high enough that LRU’s natural temperature ranking will keep the working set in cache.

Both of those conditions are uncommon in document-embedding pipelines.

Why content-hash usually wins for documents

Document corpora — the thing most RAG pipelines embed — have two properties LRU doesn’t reward:

Stability. A given chunk’s bytes don’t change between ingests. If you embedded chunk_id=4711 on Monday, the embedding you get on Friday is byte-identical. Caching by content hash means the second embedding is free forever, not just until LRU evicts it.
Bursty access. Ingest workloads are not steady-state. You don’t re-embed a chunk every minute — you re-embed it once at ingest, and then maybe not for a week. LRU’s recency signal is exactly wrong here; the chunk you embedded a week ago is the one most likely to be re-requested when the next ingest runs, but LRU will have aged it out.

Concretely: a documentation corpus of 12,000 chunks, re-indexed weekly. Each ingest touches every chunk exactly once. An LRU with a budget of 1,000 entries will have a cache hit ratio of approximately zero across ingest runs — every chunk you ask for was evicted to make room for the 1,000 chunks that came after it during the previous ingest. A content-hash store with no eviction has a hit ratio of approximately 100% from the second ingest onward. The difference is the entire embedding API bill for that corpus.

”But LRU is what every cache library defaults to”

It is. That is because the people writing cache libraries are thinking about HTTP response caches, JSON-RPC memoization, and template renders — all workloads with high cardinality and recency-correlated access. Embedding inputs in a document pipeline are low cardinality (every chunk appears exactly once per ingest) and recency-anticorrelated (the chunk you embedded longest ago is the one you’ll re-embed next).

This is one of the rare places where the default cache strategy in your stdlib is actively wrong for the workload.

The hybrid that sometimes makes sense

There is a middle position: content-hash keying with a size-bounded eviction policy, where eviction kicks in only above a generous ceiling. This is what you want if you have mixed traffic — a primary document corpus plus an open-ended user-query stream. The document chunks accumulate and never get evicted (their count is bounded by the corpus). The user queries grow unbounded until they hit the ceiling, at which point the oldest are dropped.

Implemented naively, this is “content-hash key + LRU eviction at N entries.” Implemented well, you give the document chunks a permanence flag so the LRU never touches them. EmbedCache’s SQLite-backed store is this shape by default — every entry persists; eviction is something you opt into, not the default.

Things that look like caching but aren’t

A few patterns get mistaken for caches and cause confusion when their hit ratios are catastrophic:

In-memory dicts that don’t survive restart. These are warmups. Useful at process startup, useless across deploys.
Vector stores used as caches. Qdrant, pgvector, et al. store vectors keyed by document ID. If you have to re-embed before you can query the vector store, the vector store is not your cache. The cache lives in front of the embedder.
HTTP response caching on the embedding endpoint. Hosted embedders usually disable response caching at the CDN layer; even if they didn’t, the request body contains your text and is not cacheable by a generic edge cache.

The cache you want is in your process, keyed by your content hash, and backed by a store that survives a kill -9. Everything else is adjacent.

What to actually measure

The single useful number is hit ratio over a full ingest cycle. Pick the window that matches your re-ingest cadence — daily, weekly, on-CI — and measure how many embedding requests resolve from cache vs. go to the embedder over that window.

For a document corpus on a content-hash cache: this number should approach 100% on every cycle after the first one. If it doesn’t, your key derivation is non-deterministic (locale-dependent string normalization is the usual culprit) or your store isn’t actually persisting.

For a chat corpus on an LRU: this number should match the empirical hit rate of your hot set. If your top 200 queries account for 60% of traffic and your LRU holds 200 entries, you should see ~60%.

Anything else is a cache that isn’t doing the thing the architecture diagram claims it does. Most pipelines are in that state.

The fix is the boring one: pick the strategy that matches your access pattern, write down the cache key explicitly, and stop pretending an unbounded dict in a gunicorn worker counts.

A practical migration path

If you have an existing LRU cache in front of an embedder and you’re debating switching to content-hash with no eviction, the migration is mostly mechanical. The cache key derivation already exists; you’re changing the store, not the key. Move from your in-process LRU to a SQLite-backed (or Redis-backed) content-hash store. Drop the eviction policy. Watch the hit ratio climb on the second ingest cycle.

The one place this gets uncomfortable is storage budget. An LRU is self-limiting; a no-eviction store grows monotonically. For most document corpora this is fine — the corpus is bounded, the cache is bounded. For unbounded user-generated content, you do need an upper ceiling. Set one generously (say, 10× your expected steady-state size) and revisit it quarterly, rather than letting LRU evict your warmest entries every Tuesday.

A second consideration: cache invalidation when the corpus itself changes. A content-hash cache doesn’t know that a document was deleted; the orphaned vectors will sit there forever unless you sweep them. The sweep is cheap — walk your current corpus, build a set of expected content hashes, drop everything in the cache that isn’t in the set — and it can run nightly. Most teams never bother and the cache just grows; that’s fine until it isn’t.

The deeper question raised by all of this: caches are usually treated as an implementation detail, but for embedding workloads the cache is the architecture. The embedder is a function; the cache is the state. Picking the wrong cache shape is picking the wrong state model. The content-hash-with-no-eviction shape models the embedder as a pure function over a stable corpus. The LRU shape models it as a hot-path optimizer over a stream of novel inputs. Both are real workloads. Most RAG pipelines are the first one. Cache them that way.

Written for the embedcache project by Skelf Research. Found a mistake? Open an issue.

Content-hash keying vs LRU: which actually saves money