Stop recomputing embeddings.
Start shipping faster.
EmbedCache generates text embeddings locally using FastEmbed and caches them in SQLite. No API keys, no per-token billing, no rate limits. Use it as a Rust library or run it as a REST service.
$ embedcache
listening on http://127.0.0.1:8081
models loaded: BGESmallENV15, AllMiniLML6V2
cache: ./cache.db (sqlite)
$ curl -X POST localhost:8081/v1/embed \
-d '{"text":["hello world"]}'
{"vectors":[[ -0.021, 0.114, ... ]],
"model":"BGESmallENV15",
"cache":"hit",
"elapsed_ms":3} 22+ models, no network
BGE, MiniLM, Nomic, multilingual E5 — all run on-box via FastEmbed. No OpenAI, no Cohere, no egress bill.
SQLite, survives restart
Process a URL or a string once; subsequent embeddings hit the cache in single-digit milliseconds.
Library or service
Embed the crate in your Rust app, or run the REST API behind your pipeline. Swagger, ReDoc, RapiDoc, Scalar included.
What it is honest about
Yes
- Local inference via
fastembed(CPU; ONNX runtime). - SQLite-backed cache keyed by content + model.
- REST endpoints:
/v1/embed,/v1/process,/v1/params. - Optional LLM-based semantic chunking via Ollama or OpenAI.
- 22+ embedding models from the FastEmbed catalogue.
Not
- A drop-in cache wrapper for OpenAI / Cohere / Voyage. It replaces them.
- A vector database. Pair it with Qdrant, pgvector, LanceDB, etc.
- A managed cloud service. You run it.
- A GPU-required system. CPU is the path; GPU is not advertised.
Recent notes
- 2026-05-12 Embedding API spend is a tax on stable inputs
Most of the budget you burn on hosted embedding APIs pays for vectors you've already computed. A cache turns that line item into a one-time cost.
- 2026-04-08 Content-hash keying vs LRU: which actually saves money
Two cache strategies for embeddings — one keyed on content, one keyed on access recency. They optimize for different things, and only one of them is correct for stable corpora.
- 2026-03-03 Stale-while-revalidate for embeddings: the corner cases
SWR works well for HTTP, less obviously for embeddings. A pragmatic walk through when it helps, when it bites, and what to do about model migrations.