AI

Reducing hallucinations in retrieval-augmented chatbots for customer support teams

When customer support teams adopt retrieval-augmented generation (RAG) to power chatbots, the promise is compelling: fast, contextually-aware answers grounded in a company's own documentation. In practice, however, one problem keeps surfacing — hallucinations. These are fluent, plausible-sounding responses that confidently state incorrect facts or invent citations. I've worked with product and security teams who’ve felt that a seemingly...

Read more...

Choosing a self-hosted vector database for on-device llm search: milvus, pgvector or chroma?

When I started evaluating self-hosted vector databases for on-device LLM search, I expected a straightforward tradeoff: pick the fastest engine and you're done. Reality was messier. The right choice depends on workload patterns, hardware constraints, embedding strategy, and how much operational complexity you’re willing to accept. Below I walk through what I learned comparing Milvus, pgvector and Chroma—practical differences, deployment...

Read more...

Comparing on‑device speech recognition engines for offline dictation workflows

When I moved several long-form writing workflows entirely offline, the single biggest friction point was reliable, accurate dictation that respected privacy and worked without an internet connection. Cloud ASR (automatic speech recognition) is great for accuracy, but for sensitive notes, interviews, or fieldwork where connectivity is spotty, on-device speech recognition is the only realistic option. I spent months evaluating and integrating...

Read more...

Understanding model distillation: make your LLM run fast on a laptop without cloud costs

I remember the first time I tried to run a modern language model on my laptop: it was slow, memory-starved, and I spent more time swapping RAM than actually getting useful responses. Since then I’ve tested pruning, quantization, on-device runtimes and — most importantly — model distillation. Distillation is the technique that finally let me run capable models locally without paying cloud fees or sacrificing privacy. In this piece I’ll...

Read more...