Embedding API spend is a tax on stable inputs
Most of the budget you burn on hosted embedding APIs pays for vectors you've already computed. A cache turns that line item into a one-time cost.
Pull a month of egress logs from any production RAG pipeline that calls a hosted embedding endpoint and you will find the same shape. A long, fat tail of unique chunks at the start of the month, then a torrent of repeat traffic for the rest of it. The same chunk hashes, the same model identifier, the same 1536-dimension float vectors coming back. You paid for them the first time. You paid again on every re-ingest. You paid a third time because the worker that re-embeds modified-but-not-actually-changed documents does not check whether the bytes were modified — it checks whether the file’s mtime moved.
This is the tax on stable inputs. The embedding API does not know that your content barely changes. Your pipeline does not tell it. So it charges you, correctly from its own perspective, every time you ask.
How big is the tax?
The number is embarrassingly large on most pipelines that haven’t been audited for it. A representative shape: a documentation site with 12,000 chunks at 400 tokens each. Re-embedded weekly because the build pipeline re-runs the indexer on every deploy. At a typical hosted price of $0.02 per million input tokens, that’s 12,000 × 400 × 52 ≈ 250M tokens per year, roughly $5 per year for that one corpus. Trivial.
Multiply by a few hundred corpora, multiply by a team that re-ingests on every CI run instead of every content change, multiply by the moment someone adds a second embedding model “for comparison” — and the tax crosses into actually-noticeable territory. The teams who notice are usually the ones running a backfill against an OpenAPI corpus or a code search index where chunk counts go into the tens of millions.
The other tax, the one no invoice tells you about, is latency. A cold embedding endpoint at 200ms per request times the eight-thread pool you have to keep under the rate limit is a backfill that takes overnight. A cache hit at 3ms is a backfill that takes minutes.
Why the cache rarely exists
There is a sociological reason embedding caches don’t get built. The endpoint feels stateless from the caller’s side — text in, vector out — and the natural place to memoize the call is “inside the embedder,” which nobody owns. The embedding client library is a thin HTTP wrapper. The pipeline framework is a DAG of calls. The vector store wants vectors, not raw text. No layer in the stack has both halves of the cache key.
So everyone adds a TODO and ships.
The cache key is not subtle. It is hash(text_bytes) || model_id || maybe(preprocessing_version). A SHA-256 of the chunk text, concatenated
with a short model identifier, optionally tagged with a version for the
tokenizer or any normalization step that came before the embedder. Two
fields, possibly three. The store can be a hash table, a Redis instance,
a SQLite file. EmbedCache picks the SQLite file because it survives a
restart with no orchestration and because the vector blob fits cleanly
into a BLOB column.
The honest part
This argument is sometimes pitched as “stop paying OpenAI.” That framing
is half-right. The other half is that hosted embedding APIs are excellent
at what they do — they’re the lowest-effort way to ship a v1, and the
quality of text-embedding-3-large is genuinely hard to match on-box for
some retrieval tasks. The case for caching them is not that you should
stop using them. It’s that you should stop re-using them for inputs
they’ve already seen.
A cache wrapper around a hosted embedder is one architecture. The
LangChain CacheBackedEmbeddings
pattern is exactly this — keep the hosted client, put a LocalFileStore
in front. That’s the right call when (a) you have a stable hosted model
you trust the quality of, and (b) your bottleneck is repeat cost rather
than throughput.
EmbedCache makes the other bet: replace the hosted call entirely with a locally-run FastEmbed model and cache the local result. This is the right call when (a) your cost or rate-limit pain is bad enough that removing the network hop is the cleanest fix, (b) your retrieval quality target is met by a BGE or MiniLM at the dimensions you can afford, and (c) you’d rather own a 200–800 MB ONNX model in RAM than a rate-limit ticket with vendor support.
Both are valid. They are different products. The thing to avoid is having neither — which is the default state of most pipelines today.
What “good enough” looks like
The bar for an embedding cache to pay for itself is low. You need:
- A stable, deterministic cache key. SHA-256 of the normalized text plus the model identifier is enough. Beware of locale-dependent string normalization that subtly shifts under your feet.
- A store that survives process restart. In-memory dictionaries are not caches; they are warmups.
- An eviction policy that matches your access pattern. For document corpora that don’t churn, you don’t need eviction at all — you need a drop policy keyed to “this corpus was deleted.” For chat histories or user-generated content, LRU with a generous ceiling is fine.
- A way to namespace by model. When you A/B a new embedder, you do not want the old vectors served up as cache hits. The model identifier belongs in the key, not in a sidecar table.
Notably absent from that list: anything exotic. No probabilistic data structures, no Bloom filters, no custom compression. The vector is 1.5KB to 4KB depending on dimension and dtype. SQLite will handle a hundred million of those without making you think.
The thing you actually want to measure
If you take one number away from this: measure your cache hit ratio on embedding calls, separately from the rest of your pipeline metrics. A production RAG service that re-indexes weekly should be running at >95% embedding cache hit ratio in steady state. If yours is at 0% — and it probably is, because nobody wired the cache — that gap is the tax. Most of it is recoverable on a Tuesday afternoon.
The remaining 5% is the new content that genuinely needs to be embedded. That’s the line item your monthly invoice should reflect. Anything above it is your pipeline failing to remember what it already knew.
A small worked example
Take a 500-page documentation site, chunked at ~400 tokens per chunk with 50-token overlap. That’s roughly 1,800 chunks. Indexed weekly by a CI job that runs on every merge to main — let’s say 30 merges in a busy week, but the indexer doesn’t dedupe across runs. That’s 30 × 1,800 = 54,000 embedding calls per week, against the same 1,800 unique inputs. Cache hit ratio: 1,800 / 54,000 ≈ 3.3%. Repeat-cost ratio: 96.7%.
Now add a cache. The first run computes 1,800 embeddings. Every subsequent run hits the cache for 1,800 of them. New embedding calls per week, in steady state, equals the number of chunks that genuinely changed — usually a few dozen on a typical doc site. Cache hit ratio: 98%+. The savings come from eliminating wasted work, not from a clever algorithm.
The math gets more interesting at higher chunk counts and tighter re-ingest cadences. A code-search index over a 200k-file monorepo re-indexed on every merge can easily run into double-digit percentages of total compute budget being spent on inputs that haven’t changed. For those workloads, the cache stops being a nice-to-have and starts being a load-bearing component of the pipeline.
The cost of not caching is mostly invisible
The last awkward fact: the embedding bill is rarely large enough on its own to trigger a review. It’s a line item, it’s recurring, it grows slowly. The review that catches it is usually a broader “why is our AI infra spend climbing?” audit, where the embedding line shows up as ten or twenty percent of total cost and turns out to be trivially reducible. The pattern repeats across teams: the cache is not built because nobody had a clear ownership boundary for it; it gets built six months later when someone with a finance background runs the numbers. Building it earlier costs an afternoon. Building it later costs the difference between what you spent and what you should have spent — which, depending on the corpus, is the price of a mid-range laptop per quarter.
Written for the EmbedCache project by Skelf Research. Found a mistake? Open an issue.