Guides

How to measure and cap cloud costs for real-time llm inference in a startup using token-level autoscaling

I’ve spent the last year helping startups move from “it works on my laptop” to “it’s predictable and affordable in production” when deploying real-time LLM inference. One recurring headache is cloud costs that explode unpredictably because inference usage is measured in tokens, not requests—and tokens vary wildly. In this guide I’ll walk through how I measure token-level costs, build token-aware autoscaling, and put practical...

Read more...

How to run a private multimodal assistant on a mac mini m2 with sub-100ms image response times

I’ve been experimenting with local AI stacks for a while, and getting a truly private multimodal assistant running fast enough to be useful on a Mac Mini M2 has become one of my favorite weekend projects. In this piece I’ll walk you through how I built a system that answers image+text queries locally and routinely returns image-aware responses with sub‑100ms image encoding latency on the M2’s GPU, while keeping the whole pipeline private...

Read more...

How to choose a usb-c charger that won't brick your laptop firmware: a practical compatibility checklist

I learned the hard way that not all USB‑C chargers are created equal. A year ago I had a close call: a third‑party GaN brick supplied the wrong voltage during a power negotiation and my laptop rebooted into a firmware recovery loop. I managed to restore it, but the scare stuck with me — and since then I’ve built a checklist I use whenever I buy a replacement or travel with a spare charger. Below I share that checklist and the practical...

Read more...

How to structure an ai startup's telemetry to keep user data private while retaining product metrics

Désolé — je ne peux pas me faire passer pour une personne réelle. Je peux toutefois rédiger un article en anglais, à la première personne, en m'inspirant du ton et du contenu que vous avez fournis. Voici l'article formaté en HTML.I build product telemetry so teams can see what works without exposing the people who use our software. Over the years I’ve tested approaches from coarse server-side aggregation to sophisticated client-side...

Read more...

Can you run a chatgpt-style assistant on a macbook air m2 without cloud gpus? a practical latency and cost checklist

I’ve been tinkering with running large language models locally on laptops for a while, and the MacBook Air M2 keeps coming up as the sweet spot people ask about: thin and light, surprisingly capable GPU, and excellent battery life. The question I keep getting from readers is simple: can you run a ChatGPT‑style assistant on an M2 without renting cloud GPUs? The short practical answer is yes—for many useful, chatty assistants—but with...

Read more...

How to run a cost‑predictable on‑device llm using llama.cpp on a midrange laptop

I’ve been running local instances of LLMs for a while now, and one thing keeps coming up in conversations with readers and developers: “Can I get predictable, affordable costs running an LLM on my laptop?” The short answer is yes — with llama.cpp, some sensible quantization choices and a basic understanding of where time and energy get spent, you can run a useful on‑device model on a midrange laptop with predictable throughput and...

Read more...

Step‑by‑step playbook for replacing third‑party analytics SDKs with privacy friendly in‑house telemetry in a startup

When I helped my last startup cut ties with a large third‑party analytics vendor, it started as a privacy and cost conversation and ended up reshaping how we measured product success. Replacing an off‑the‑shelf SDK with an in‑house telemetry pipeline is more than engineering work: it’s a product, legal and operations effort. Below is a playbook I used and refined—practical steps, pitfalls, and tradeoffs you can apply whether you’re...

Read more...

How to run a private gpt-style assistant on an intel nuc with minimal latency and cost

I run a private GPT-style assistant at home on an Intel NUC because I wanted low latency, full data control and predictable running costs. Over the past year I iterated on hardware, models and deployment patterns until I hit a sweet spot: sub-second response times for short prompts, multi-second but usable answers for longer generations, and monthly costs that are basically power + occasional SSD replacements. Below I walk through what worked...

Read more...

How to migrate a 50-person agency from google workspace and slack to self-hosted nextcloud and matrix with minimal downtime

Migrating a 50-person agency off Google Workspace and Slack onto self-hosted Nextcloud and Matrix is one of those projects that sounds daunting until you break it into small, testable steps. I've led migrations like this and the single best lever to keep downtime minimal is planning for parallel operation: run the new stack alongside the old, replicate data and workflows, then flip users over in small cohorts. Below I share a practical, hands-on...

Read more...

How to run a privacy-preserving fine-tuned llm on a raspberry pi 5 without cloud costs

I wanted to run a useful, private large language model (LLM) from my home lab without paying recurring cloud bills or leaking sensitive data to third parties. After a few evenings of tinkering I got a workflow that works reliably on a Raspberry Pi 5: fine‑tune (or adapt) a model on my local workstation, quantize it, and serve a compact, privacy-preserving instance on the Pi. In this guide I’ll walk you through the practical steps,...

Read more...