Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

When I started evaluating self-hosted vector databases for on-device LLM search, I expected a straightforward tradeoff: pick the fastest engine and you're done. Reality was messier. The right choice depends on workload patterns, hardware constraints, embedding strategy, and how much operational complexity you’re willing to accept. Below I walk through what I learned comparing Milvus, pgvector and Chroma—practical differences, deployment notes, and which tool I reach for depending on the project.

What I mean by "on-device LLM search"

When I say on-device LLM search, I mean embedding-based retrieval that runs close to the user or in an environment where data privacy and low-latency are priorities: mobile apps, edge devices, local desktops, or tightly controlled server instances. "On-device" here doesn't necessarily mean on a phone CPU-only; it can include an edge server with limited CPU/RAM or a small GPU. The key constraints are resource limits, privacy concerns, and the need to avoid heavy cloud dependencies.

Core questions I asked

Early on I framed a few practical questions that guided my testing:

How easy is it to run locally and keep data private?

What are memory and disk footprints for modest datasets (millions of vectors)?

Does it support the ANN algorithm I prefer (HNSW, IVF, PQ, etc.)?

How well does it integrate with common embedding pipelines and LLM stacks?

Operational concerns: backups, consistency, replication, schema evolution.

Quick feature snapshot

Feature	Milvus	pgvector	Chroma
Primary model	Dedicated vector DB	Extension to PostgreSQL	Lightweight vector store + SDK
ANN algorithms	HNSW, IVF+PQ, flat, GPU-accelerated	HNSW (via ivfflat + others depend on index)	HNSW; customizable
Scaling	Distributed, sharding, HA	Scale via PostgreSQL tooling (sharding/replication)	Single-node; enterprise options for distributed
Metadata/SQL	Rich metadata + SDKs	Full SQL + relational joins	Metadata support but not full SQL
Ops complexity	Higher (microservices, etc.)	Familiar to DBAs	Low—single process
Licensing	Open source (community, some enterprise features)	Open source (Postgres + extension)	Open source core; commercial offerings

Milvus: heavy duty, feature rich

Milvus felt like the "enterprise vector database" in my tests. It has mature clustering, supports GPU acceleration, and implements multiple indexes (HNSW, IVF+PQ) that let you tune the recall/latency tradeoff across millions or billions of vectors. If your on-device environment is actually an edge cluster (a set of servers close to users) or you anticipate scaling to very large corpora, Milvus is compelling.

What I liked:

Production-ready clustering and replication.

GPU support for both indexing and search—huge speedups if you can provision a small GPU.

Multiple index types and built-in management primitives for reindexing and partitioning.

What I didn’t love:

Operational complexity. Milvus runs as several services (data, index, etc.) and requires Kubernetes/Docker for smooth operations.

Memory/disk footprint can be significant for small deployments—overkill for strictly on-device single-node use.

When I pick Milvus: distributed edge servers or a small private cloud where I need scale, HA, and hardware acceleration.

pgvector: simplicity and SQL power

pgvector is an extension to PostgreSQL that makes vectors first-class citizens in a relational DB. In practice it’s the most pragmatic option if you value SQL, transactional guarantees and the ability to mix vector search with relational queries.

What I liked:

Familiar operational model: run Postgres, install extension, you're done. That’s a big win for teams with DBAs.

Full SQL + joins mean you can do hybrid queries (text filters + vector ranking) without glue code.

Lightweight for smaller datasets and easy to backup/replicate using existing Postgres tooling.

What I didn’t love:

pgvector index performance lags specialized engines for very large corpora—though HNSW support and external indexers improve things.

Lacks built-in GPU acceleration and the advanced indexing variety of Milvus.

When I pick pgvector: when I need transactional integrity, tight relational joins, or to bolt vector search onto an existing Postgres-backed app.

Chroma: developer ergonomics first

Chroma is designed around the developer experience: a simple Python/JS SDK, fast prototyping, and straightforward persistence. It’s very appealing when you want a small, self-contained vector store that you can embed within an application or run as a single service on an edge device.

What I liked:

Extremely easy to get started—pip install, create a collection, and you can query in minutes.

Good for local-first workflows: low overhead, file-based persistence, and no heavy infra.

Integrates well with Python ML stacks and open-source LLM tooling.

What I didn’t love:

Single-node by default; scaling to distributed setups requires enterprise features or custom engineering.

Less mature for very large datasets or strict durability guarantees compared to Postgres or Milvus.

When I pick Chroma: proof-of-concept, desktop or single-server deployments, or prototypes where developer speed is the priority.

Practical tradeoffs and tips from my tests

Here are a few practical lessons I picked up while actually benchmarking and building prototypes.

Measure for your embedding dimension. 1536-dim embeddings (common for OpenAI Ada/Some LLM embeddings) weigh a lot in RAM. If your use case is mobile, consider smaller/dense models or quantized embeddings.

Index choice matters more than DB choice for latency. HNSW with tuned M/efConstruction/efSearch often delivers the best latency/recall balance on CPUs.

Batch inserts and avoid single-vector writes in tight loops. All three solutions benefit from bulk insertion strategies.

Hybrid search (filter by metadata before vector ranking) is easiest with pgvector due to SQL. With Milvus and Chroma you need more application-side orchestration unless you rely on built-in filter support.

For absolute privacy, keep everything on-device. Chroma or pgvector (local Postgres) are easiest to operate fully offline; Milvus can, but it’s heavier.

Example deployment patterns I used

Here are three patterns that reflect common needs I encounter:

Single-device, privacy-first (desktop app): Chroma or local Postgres+pgvector. Chroma for rapid dev; pgvector if you require SQL features.

Edge cluster with modest GPUs: Milvus. Use GPU for indexing and small inference servers for the LLM; tune IVF+PQ to reduce memory.

Web app with rich relational data: pgvector layered into your existing Postgres to simplify joins and backups.

Checklist to choose for your project

Do you need distributed scaling and GPU acceleration? Choose Milvus.

Is SQL and transactional guarantees important? Choose pgvector.

Is developer speed and low operational burden more important than extreme scale? Choose Chroma.

If you want, tell me about your dataset size, embedding model and hardware (CPU / RAM / GPU) and I’ll sketch a concrete deployment and index configuration tailored to your constraints.

Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

What I mean by "on-device LLM search"

Core questions I asked

Quick feature snapshot

Milvus: heavy duty, feature rich

pgvector: simplicity and SQL power

Chroma: developer ergonomics first

Practical tradeoffs and tips from my tests

Example deployment patterns I used

Checklist to choose for your project

You should also check the following news:

Reducing hallucinations in retrieval-augmented chatbots for customer support teams

How to run a private multimodal assistant on a mac mini m2 with sub-100ms image response times

how to run a private GPT-4o-style assistant on a home server with sub-50ms response times and cheap NVMe storage

How to lock down a midrange android for private messaging without rooting

Reducing hallucinations in retrieval-augmented chatbots for customer support teams

Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

Detecting malicious firmware implants on consumer routers using a raspberry pi and free tools

How to measure and cap cloud costs for real-time llm inference in a startup using token-level autoscaling