Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

When I started evaluating self-hosted vector databases for on-device LLM search, I expected a straightforward tradeoff: pick the fastest engine and you're done. Reality was messier. The right choice depends on workload patterns, hardware constraints, embedding strategy, and how much operational complexity you’re willing to accept. Below I walk through what I learned comparing Milvus, pgvector and Chroma—practical differences, deployment notes, and which tool I reach for depending on the project.

What I mean by "on-device LLM search"

When I say on-device LLM search, I mean embedding-based retrieval that runs close to the user or in an environment where data privacy and low-latency are priorities: mobile apps, edge devices, local desktops, or tightly controlled server instances. "On-device" here doesn't necessarily mean on a phone CPU-only; it can include an edge server with limited CPU/RAM or a small GPU. The key constraints are resource limits, privacy concerns, and the need to avoid heavy cloud dependencies.

Core questions I asked

Early on I framed a few practical questions that guided my testing:

  • How easy is it to run locally and keep data private?
  • What are memory and disk footprints for modest datasets (millions of vectors)?
  • Does it support the ANN algorithm I prefer (HNSW, IVF, PQ, etc.)?
  • How well does it integrate with common embedding pipelines and LLM stacks?
  • Operational concerns: backups, consistency, replication, schema evolution.
  • Quick feature snapshot

    FeatureMilvuspgvectorChroma
    Primary modelDedicated vector DBExtension to PostgreSQLLightweight vector store + SDK
    ANN algorithmsHNSW, IVF+PQ, flat, GPU-acceleratedHNSW (via ivfflat + others depend on index)HNSW; customizable
    ScalingDistributed, sharding, HAScale via PostgreSQL tooling (sharding/replication)Single-node; enterprise options for distributed
    Metadata/SQLRich metadata + SDKsFull SQL + relational joinsMetadata support but not full SQL
    Ops complexityHigher (microservices, etc.)Familiar to DBAsLow—single process
    LicensingOpen source (community, some enterprise features)Open source (Postgres + extension)Open source core; commercial offerings

    Milvus: heavy duty, feature rich

    Milvus felt like the "enterprise vector database" in my tests. It has mature clustering, supports GPU acceleration, and implements multiple indexes (HNSW, IVF+PQ) that let you tune the recall/latency tradeoff across millions or billions of vectors. If your on-device environment is actually an edge cluster (a set of servers close to users) or you anticipate scaling to very large corpora, Milvus is compelling.

    What I liked:

  • Production-ready clustering and replication.
  • GPU support for both indexing and search—huge speedups if you can provision a small GPU.
  • Multiple index types and built-in management primitives for reindexing and partitioning.
  • What I didn’t love:

  • Operational complexity. Milvus runs as several services (data, index, etc.) and requires Kubernetes/Docker for smooth operations.
  • Memory/disk footprint can be significant for small deployments—overkill for strictly on-device single-node use.
  • When I pick Milvus: distributed edge servers or a small private cloud where I need scale, HA, and hardware acceleration.

    pgvector: simplicity and SQL power

    pgvector is an extension to PostgreSQL that makes vectors first-class citizens in a relational DB. In practice it’s the most pragmatic option if you value SQL, transactional guarantees and the ability to mix vector search with relational queries.

    What I liked:

  • Familiar operational model: run Postgres, install extension, you're done. That’s a big win for teams with DBAs.
  • Full SQL + joins mean you can do hybrid queries (text filters + vector ranking) without glue code.
  • Lightweight for smaller datasets and easy to backup/replicate using existing Postgres tooling.
  • What I didn’t love:

  • pgvector index performance lags specialized engines for very large corpora—though HNSW support and external indexers improve things.
  • Lacks built-in GPU acceleration and the advanced indexing variety of Milvus.
  • When I pick pgvector: when I need transactional integrity, tight relational joins, or to bolt vector search onto an existing Postgres-backed app.

    Chroma: developer ergonomics first

    Chroma is designed around the developer experience: a simple Python/JS SDK, fast prototyping, and straightforward persistence. It’s very appealing when you want a small, self-contained vector store that you can embed within an application or run as a single service on an edge device.

    What I liked:

  • Extremely easy to get started—pip install, create a collection, and you can query in minutes.
  • Good for local-first workflows: low overhead, file-based persistence, and no heavy infra.
  • Integrates well with Python ML stacks and open-source LLM tooling.
  • What I didn’t love:

  • Single-node by default; scaling to distributed setups requires enterprise features or custom engineering.
  • Less mature for very large datasets or strict durability guarantees compared to Postgres or Milvus.
  • When I pick Chroma: proof-of-concept, desktop or single-server deployments, or prototypes where developer speed is the priority.

    Practical tradeoffs and tips from my tests

    Here are a few practical lessons I picked up while actually benchmarking and building prototypes.

  • Measure for your embedding dimension. 1536-dim embeddings (common for OpenAI Ada/Some LLM embeddings) weigh a lot in RAM. If your use case is mobile, consider smaller/dense models or quantized embeddings.
  • Index choice matters more than DB choice for latency. HNSW with tuned M/efConstruction/efSearch often delivers the best latency/recall balance on CPUs.
  • Batch inserts and avoid single-vector writes in tight loops. All three solutions benefit from bulk insertion strategies.
  • Hybrid search (filter by metadata before vector ranking) is easiest with pgvector due to SQL. With Milvus and Chroma you need more application-side orchestration unless you rely on built-in filter support.
  • For absolute privacy, keep everything on-device. Chroma or pgvector (local Postgres) are easiest to operate fully offline; Milvus can, but it’s heavier.
  • Example deployment patterns I used

    Here are three patterns that reflect common needs I encounter:

  • Single-device, privacy-first (desktop app): Chroma or local Postgres+pgvector. Chroma for rapid dev; pgvector if you require SQL features.
  • Edge cluster with modest GPUs: Milvus. Use GPU for indexing and small inference servers for the LLM; tune IVF+PQ to reduce memory.
  • Web app with rich relational data: pgvector layered into your existing Postgres to simplify joins and backups.
  • Checklist to choose for your project

  • Do you need distributed scaling and GPU acceleration? Choose Milvus.
  • Is SQL and transactional guarantees important? Choose pgvector.
  • Is developer speed and low operational burden more important than extreme scale? Choose Chroma.
  • If you want, tell me about your dataset size, embedding model and hardware (CPU / RAM / GPU) and I’ll sketch a concrete deployment and index configuration tailored to your constraints.


    You should also check the following news:

    AI

    Reducing hallucinations in retrieval-augmented chatbots for customer support teams

    09/06/2026

    When customer support teams adopt retrieval-augmented generation (RAG) to power chatbots, the promise is compelling: fast, contextually-aware answers...

    Read more...
    Reducing hallucinations in retrieval-augmented chatbots for customer support teams
    Guides

    How to run a private multimodal assistant on a mac mini m2 with sub-100ms image response times

    11/05/2026

    I’ve been experimenting with local AI stacks for a while, and getting a truly private multimodal assistant running fast enough to be useful on a...

    Read more...
    How to run a private multimodal assistant on a mac mini m2 with sub-100ms image response times