v0.1 — Rust · GPL-3.0 · on crates.io

Stop recomputing
embeddings.

embedcache generates text embeddings locally with FastEmbed and caches them in SQLite keyed by content hash and model. No hosted API, no per-token billing, no rate limits. Use it as a Rust library or run the REST service.

Quickstart View on GitHub

$ cargo install embedcache

embedcache — REST service

$ cargo install embedcache
$ embedcache
listening on http://127.0.0.1:8081
models loaded: BGESmallENV15, AllMiniLML6V2
cache: ./cache.db (sqlite)

$ curl -X POST localhost:8081/v1/embed \
    -d '{"text":["hello world"]}'
{
  "model": "BGESmallENV15",
  "cache": "hit",
  "vectors": [[ -0.021, 0.114, ... ]]
}

What is embedcache?

embedcache is a local embedding generator with a content-hash cache, written in Rust and built on the FastEmbed crate. It owns the embedder rather than memoizing a hosted one: run the model on your own hardware and store each vector in SQLite so you never pay to compute it twice.

What it is

A Rust library and REST API that runs embedding models locally via FastEmbed.
A SQLite cache keyed by content hash + model — the same input is embedded once.
22+ ONNX models (BGE, MiniLM, Nomic, multilingual E5) that run on CPU.
Optional LLM-driven semantic chunking through Ollama or OpenAI.

What it is not

A caching proxy in front of OpenAI/Cohere/Voyage — it replaces the embedder.
A vector database — pair it with Qdrant, pgvector, LanceDB, or Milvus.
A managed cloud service — you install it and run it yourself.
A GPU-required system — CPU via ONNX runtime is the primary path.

The problems embedcache solves

Most of the pain in an embedding pipeline is not the model — it is paying for, waiting on, and leaking to a hosted embedder. embedcache moves that work on-box.

💸

Stable inputs get billed repeatedly

The problem

The same document chunks pass through the same hosted model and produce the same vector — over and over — because the cache was never wired up.

How embedcache helps

embedcache keys every vector by content hash + model id in SQLite. Re-embedding an unchanged chunk is a cache read, not another API call.

Embedding spend is a tax on stable inputs →

⏳

Rate limits stall backfills

The problem

Embedding a large corpus against a hosted API means throttling, retries, and multi-day jobs governed by someone else’s quota.

How embedcache helps

Local inference has no per-minute quota. You are bounded by your own CPU, and cache hits skip inference entirely.

Batch RAG ingestion →

🔒

Content leaves your network

The problem

Sending every chunk to a third-party embedding endpoint means your text — and its metadata — crosses a boundary you do not control.

How embedcache helps

The model runs on the box you already own. Text is embedded in-process; nothing is sent to an external embedding provider.

Air-gapped embeddings →

📌

Reproducibility drifts

The problem

Hosted embedding models change under you. A vector you computed last quarter may not match one you compute today against the “same” endpoint.

How embedcache helps

Bundled ONNX model weights are pinned. The same model id produces the same vector, and the cache records exactly which model wrote each row.

How caching works →

Two ways to call it

Hit the REST service from any language, or depend on the crate and skip the network hop entirely. Either way, repeated inputs come back from the cache.

REST POST /v1/embed

# Embed an array of strings — cached by content hash + model
curl -X POST localhost:8081/v1/embed \
  -H "content-type: application/json" \
  -d '{
    "model": "BGESmallENV15",
    "text": ["invoices are due monthly", "invoices are due monthly"]
  }'

# Second identical string is served from the SQLite cache
{
  "model": "BGESmallENV15",
  "vectors": [[...384 floats...], [...384 floats...]],
  "cache": ["miss", "hit"]
}

Rust embedcache crate

// Use the crate directly — no server needed
use embedcache::{Cache, Model};

let cache = Cache::open("cache.db")?;
let vectors = cache
    .embed(Model::BGESmallENV15, &["hello world"])?;

// vectors[0] is a Vec<f32> of length 384.
// Call again with the same text -> returned from SQLite.

Illustrative snippets. See the docs for the exact API surface.

Everything in one small binary

Local models, a persistent cache, a REST surface, and optional semantic chunking — nothing to provision, no API keys to rotate.

Local embeddings + cache

The foundation: run the model on-box and never recompute the same vector.

Local inference

Embeddings are generated on the box you already own via the FastEmbed crate and the ONNX runtime. No calls to OpenAI, Cohere, or Voyage. No per-token bill, no rate limits.

Learn more →

Content-hash cache

Every vector is stored in SQLite keyed by the content hash plus the model identifier. The same input under the same model is never embedded twice.

Learn more →

22+ bundled models

BGE small / base / large, MiniLM, Nomic, and multilingual E5 ship from the FastEmbed catalogue. Pick per request; enable a subset with ENABLED_MODELS.

Learn more →

Serving surface

Use it as a Rust crate or a self-hosted REST service with interactive docs.

Library or service

Depend on the embedcache crate directly in a Rust app, or run the standalone REST service in front of any language stack.

Learn more →

REST API

POST /v1/embed for text arrays, POST /v1/process to fetch-chunk-embed a URL, GET /v1/params to list models and chunkers.

Learn more →

Four docs UIs

Swagger, ReDoc, RapiDoc, and Scalar are mounted out of the box at /swagger, /redoc, /rapidoc, and /scalar for exploring the API interactively.

Learn more →

Chunking

Whitespace by default; optional LLM-driven semantic boundaries.

Whitespace chunking

The default "words" chunker splits on whitespace — always available, fast, and requires no LLM. Good enough for most ingest pipelines.

Learn more →

Optional LLM chunking

llm-concept and llm-introspection use Ollama or OpenAI to find semantic boundaries instead of whitespace splits, for callers who want cleaner chunks.

Learn more →

Operations

CPU-first, env-configured, single-file state, open source.

CPU-first

The primary supported path is CPU via the ONNX runtime. No GPU is required to run any bundled model.

Learn more →

Env-var configuration

SERVER_HOST, SERVER_PORT, DB_PATH, ENABLED_MODELS, and the LLM_* variables configure the service from the environment or a .env file.

Learn more →

Single-file state

The SQLite cache file is the only state the service owns. Back it up, ship it between environments, or delete it to start cold.

Learn more →

Open source

embedcache is published on crates.io and developed in the open under GPL-3.0. Read the code, file issues, send patches.

Learn more →

22+ models, all local

Pick a model per request. These five are a representative slice of the FastEmbed catalogue that ships with embedcache.

AllMiniLML6V2 384-dim

Fast, general purpose

BGESmallENV15 384-dim

Best quality/speed balance

BGEBaseENV15 768-dim

Higher quality

BGELargeENV15 1024-dim

Highest quality among bundled English models

MultilingualE5Base 768-dim

100+ languages

See all models →

22+

ONNX models bundled via FastEmbed

REST endpoints: /embed, /process, /params

Docs UIs: Swagger, ReDoc, RapiDoc, Scalar

Per-token embedding bill

Explore the docs

Everything about embedcache, one page at a time — from the quickstart to how the cache key actually works.

Own your embedder

embedcache is open source (GPL-3.0). Install the crate, run the service, and stop paying to recompute vectors you already have.

Read the quickstart Star on GitHub

Running embedcache in production?

Tell us about your models, throughput, and cache size — we read every message.

Get in touch

Stop recomputing embeddings.

What is embedcache?

What it is

What it is not

The problems embedcache solves

Stable inputs get billed repeatedly

Rate limits stall backfills

Content leaves your network

Reproducibility drifts

Two ways to call it

Everything in one small binary

Local embeddings + cache

Local inference

Content-hash cache

22+ bundled models

Serving surface

Library or service

REST API

Four docs UIs

Chunking

Whitespace chunking

Optional LLM chunking

Operations

CPU-first

Env-var configuration

Single-file state

Open source

22+ models, all local

Explore the docs

Features

Quickstart

How it works

Models

Use cases

FAQ

Glossary

Notes

Compare

About

Own your embedder

Running embedcache in production?

Stop recomputing
embeddings.