note · 2026-05-12

Embedding API spend is a tax on stable inputs

Most of the budget you burn on hosted embedding APIs pays for vectors you've already computed. A cache turns that line item into a one-time cost.

#cost#rag#caching

Pull a month of egress logs from any production RAG pipeline that calls a hosted embedding endpoint and you will find the same shape. A long, fat tail of unique chunks at the start of the month, then a torrent of repeat traffic for the rest of it. The same chunk hashes, the same model identifier, the same 1536-dimension float vectors coming back. You paid for them the first time. You paid again on every re-ingest. You paid a third time because the worker that re-embeds modified-but-not-actually-changed documents does not check whether the bytes were modified — it checks whether the file’s mtime moved.

This is the tax on stable inputs. The embedding API does not know that your content barely changes. Your pipeline does not tell it. So it charges you, correctly from its own perspective, every time you ask.

How big is the tax?

The number is embarrassingly large on most pipelines that haven’t been audited for it. A representative shape: a documentation site with 12,000 chunks at 400 tokens each. Re-embedded weekly because the build pipeline re-runs the indexer on every deploy. At a typical hosted price of $0.02 per million input tokens, that’s 12,000 × 400 × 52 ≈ 250M tokens per year, roughly $5 per year for that one corpus. Trivial.

Multiply by a few hundred corpora, multiply by a team that re-ingests on every CI run instead of every content change, multiply by the moment someone adds a second embedding model “for comparison” — and the tax crosses into actually-noticeable territory. The teams who notice are usually the ones running a backfill against an OpenAPI corpus or a code search index where chunk counts go into the tens of millions.

The other tax, the one no invoice tells you about, is latency. A cold embedding endpoint at 200ms per request times the eight-thread pool you have to keep under the rate limit is a backfill that takes overnight. A cache hit at 3ms is a backfill that takes minutes.

Why the cache rarely exists

There is a sociological reason embedding caches don’t get built. The endpoint feels stateless from the caller’s side — text in, vector out — and the natural place to memoize the call is “inside the embedder,” which nobody owns. The embedding client library is a thin HTTP wrapper. The pipeline framework is a DAG of calls. The vector store wants vectors, not raw text. No layer in the stack has both halves of the cache key.

So everyone adds a TODO and ships.

The cache key is not subtle. It is hash(text_bytes) || model_id || maybe(preprocessing_version). A SHA-256 of the chunk text, concatenated with a short model identifier, optionally tagged with a version for the tokenizer or any normalization step that came before the embedder. Two fields, possibly three. The store can be a hash table, a Redis instance, a SQLite file. EmbedCache picks the SQLite file because it survives a restart with no orchestration and because the vector blob fits cleanly into a BLOB column.

The honest part

This argument is sometimes pitched as “stop paying OpenAI.” That framing is half-right. The other half is that hosted embedding APIs are excellent at what they do — they’re the lowest-effort way to ship a v1, and the quality of text-embedding-3-large is genuinely hard to match on-box for some retrieval tasks. The case for caching them is not that you should stop using them. It’s that you should stop re-using them for inputs they’ve already seen.

A cache wrapper around a hosted embedder is one architecture. The LangChain CacheBackedEmbeddings pattern is exactly this — keep the hosted client, put a LocalFileStore in front. That’s the right call when (a) you have a stable hosted model you trust the quality of, and (b) your bottleneck is repeat cost rather than throughput.

EmbedCache makes the other bet: replace the hosted call entirely with a locally-run FastEmbed model and cache the local result. This is the right call when (a) your cost or rate-limit pain is bad enough that removing the network hop is the cleanest fix, (b) your retrieval quality target is met by a BGE or MiniLM at the dimensions you can afford, and (c) you’d rather own a 200–800 MB ONNX model in RAM than a rate-limit ticket with vendor support.

Both are valid. They are different products. The thing to avoid is having neither — which is the default state of most pipelines today.

What “good enough” looks like

The bar for an embedding cache to pay for itself is low. You need:

Notably absent from that list: anything exotic. No probabilistic data structures, no Bloom filters, no custom compression. The vector is 1.5KB to 4KB depending on dimension and dtype. SQLite will handle a hundred million of those without making you think.

The thing you actually want to measure

If you take one number away from this: measure your cache hit ratio on embedding calls, separately from the rest of your pipeline metrics. A production RAG service that re-indexes weekly should be running at >95% embedding cache hit ratio in steady state. If yours is at 0% — and it probably is, because nobody wired the cache — that gap is the tax. Most of it is recoverable on a Tuesday afternoon.

The remaining 5% is the new content that genuinely needs to be embedded. That’s the line item your monthly invoice should reflect. Anything above it is your pipeline failing to remember what it already knew.

A small worked example

Take a 500-page documentation site, chunked at ~400 tokens per chunk with 50-token overlap. That’s roughly 1,800 chunks. Indexed weekly by a CI job that runs on every merge to main — let’s say 30 merges in a busy week, but the indexer doesn’t dedupe across runs. That’s 30 × 1,800 = 54,000 embedding calls per week, against the same 1,800 unique inputs. Cache hit ratio: 1,800 / 54,000 ≈ 3.3%. Repeat-cost ratio: 96.7%.

Now add a cache. The first run computes 1,800 embeddings. Every subsequent run hits the cache for 1,800 of them. New embedding calls per week, in steady state, equals the number of chunks that genuinely changed — usually a few dozen on a typical doc site. Cache hit ratio: 98%+. The savings come from eliminating wasted work, not from a clever algorithm.

The math gets more interesting at higher chunk counts and tighter re-ingest cadences. A code-search index over a 200k-file monorepo re-indexed on every merge can easily run into double-digit percentages of total compute budget being spent on inputs that haven’t changed. For those workloads, the cache stops being a nice-to-have and starts being a load-bearing component of the pipeline.

The cost of not caching is mostly invisible

The last awkward fact: the embedding bill is rarely large enough on its own to trigger a review. It’s a line item, it’s recurring, it grows slowly. The review that catches it is usually a broader “why is our AI infra spend climbing?” audit, where the embedding line shows up as ten or twenty percent of total cost and turns out to be trivially reducible. The pattern repeats across teams: the cache is not built because nobody had a clear ownership boundary for it; it gets built six months later when someone with a finance background runs the numbers. Building it earlier costs an afternoon. Building it later costs the difference between what you spent and what you should have spent — which, depending on the corpus, is the price of a mid-range laptop per quarter.


Written for the EmbedCache project by Skelf Research. Found a mistake? Open an issue.

← All notes