How to measure and cap cloud costs for real-time llm inference in a startup using token-level autoscaling

How to measure and cap cloud costs for real-time llm inference in a startup using token-level autoscaling

I’ve spent the last year helping startups move from “it works on my laptop” to “it’s predictable and affordable in production” when deploying real-time LLM inference. One recurring headache is cloud costs that explode unpredictably because inference usage is measured in tokens, not requests—and tokens vary wildly. In this guide I’ll walk through how I measure token-level costs, build token-aware autoscaling, and put practical caps and fallbacks in place so you can ship fast without a surprise invoice.

Why token-level thinking matters

Most autoscaling systems react to requests/sec, CPU, memory, or latency. For LLMs those metrics are incomplete. A single request can generate 10 tokens or 10,000 tokens. If your autoscaler only sees request rate, it might undersize capacity for huge prompts or oversize for short ones. Token-level autoscaling treats the true work unit—tokens—as the first-class metric. That gives you predictable throughput, lower cost variance, and better user experience.

How I measure cost per token

Start by understanding two things: (1) how many tokens you process, and (2) how much compute time and cloud spend that processing consumes. Here’s a pragmatic workflow I use.

  • Instrument every request pipeline to log token counts. Count prompt tokens and generated tokens separately. If you use OpenAI/Anthropic/etc., log the token counts returned by the API. For self-hosted models, use tokenizers (e.g., Hugging Face tokenizers) to estimate prompt tokens and count generated tokens in the model output.
  • Correlate token counts with inference latency and GPU/CPU usage. Capture per-request latency, and emit a lightweight metric for tokens_processed and inference_time_ms.
  • Aggregate into tokens/sec per instance. This is the core metric for autoscaling.
  • Map cloud billing to throughput. Export billing data (AWS Cost Explorer, GCP Billing Export to BigQuery, Azure Cost Management) and correlate daily/hourly spend with aggregated token counts. Use simple spreadsheets or SQL to compute an empirical cost-per-token (or cost-per-1k-tokens) for each model and instance type.
  • Empirical cost-per-token is critical because models differ drastically. A small LLM on CPU might cost $0.0001 per 1k tokens; a high-end GPU model can be orders of magnitude higher.

    Simple formula to reason about costs

    I use this baseline formula to turn metrics into predictable numbers:

    Hourly costInstancePricePerHour + (Storage + Networking per hour)
    Tokens/hour capacityAverage tokens/sec * 3600 (per instance)
    Cost per 1k tokens(Hourly cost / Tokens/hour capacity) * 1000

    Plugging actual numbers from your telemetry gives a realistic cost model you can use for pricing and budget caps.

    Designing token-aware autoscaling

    There are multiple architectural ways to implement token-level autoscaling. Here are patterns I’ve used or recommend, ranked by simplicity to robustness.

  • Proxy-level token counting + metrics exporter: Put an API gateway or sidecar that parses requests and responses to count tokens. Expose a Prometheus metric tokens_processed_total and tokens_per_second. Use that metric as the custom metric for autoscaling (Kubernetes HPA or KEDA).
  • Per-instance token budget: Give each worker a tokens/sec capacity (derived from profiling). The autoscaler scales instances so total capacity ≥ observed tokens/sec + buffer.
  • Queue + concurrency + token-weighted workers: Insert a queue and let workers pull tasks. Each task consumes a token budget; if remaining budget on a worker is low, it rejects pulling. This prevents one worker from being overloaded with huge requests.
  • Serverless with token budget function: On serverless platforms, embed token counting and throttle based on an adjustable per-minute token burn limit stored in Redis.
  • Implementation details I use

    My go-to stack for Kubernetes deployments:

  • Prometheus to scrape token counts and tokens/sec derived from the proxy/sidecar.
  • KEDA or the Kubernetes HPA with external/custom metrics to scale based on tokens/sec.
  • Redis for short-term token accounting (fast decrement/increment for token budgets) and single-source quota enforcement across instances.
  • Ingress proxy (Envoy or Nginx) or a dedicated token-aware gateway that records prompt+response tokens and attaches that to tracing logs.
  • Practical autoscaler tuning

    Tuning is where theory meets reality. Here are rules I’ve applied:

  • Scale on tokens/sec, not requests. Use a rolling average (30–60s) to avoid thrashing from bursty users.
  • Set minimum and maximum instances. Minimum ensures low-latency baseline; maximum caps cost exposure.
  • Use a buffer factor. If a single instance handles 10k tokens/sec at 90% GPU utilization, don’t scale exactly at 10k. Apply a 1.3 buffer to avoid latency spikes from sudden growth.
  • Differentiate prompt vs generation tokens. Prompts are often heavy-cost but predictable; generation tokens are where you get runaway costs if you forget to cap max tokens per response.
  • Add a cooldown period (60–120s) for downscaling to keep warm GPUs and avoid repeated scale-up/down cycles.
  • Capping costs: engineering and policy controls

    Autoscaling limits help, but you need explicit cost caps and graceful degradation strategies:

  • Hard token caps per request: Enforce a max_tokens parameter per request. For chat apps, default to a reasonable value (e.g., 512 or 1024) and require opt-in for higher limits.
  • Per-tenant/day token budgets: Store quotas and decrement per request. When a tenant is near budget, degrade to smaller models or present a friendly message.
  • Model downgrades as fallback: If budget thresholds hit or autoscaler reaches max instances, route to a cheaper model (e.g., Llama-2 small, or a distilled model). Implement this as a policy in the gateway.
  • Rate limiting by weighted tokens: Instead of counting requests, rate limit by tokens per minute. This stops attackers or misconfigured clients from burning tokens with tiny intervals.
  • Preemptive alerts and soft caps: Emit alerts when spend approaches X% of your monthly budget. At higher thresholds, enable soft caps that slow generation token rate (e.g., lower temperature, shorter max tokens).
  • Monitoring and visibility

    Visibility is non-negotiable. My checklist:

  • Dashboards showing tokens/sec by model, token cost per 1k, spend by model, and spend by tenant.
  • Traces showing tokens consumed per trace and resulting inference time.
  • Billing exports correlated with token metrics for daily reconciliation.
  • Alerting on sudden token spikes, rising cost-per-token, or sustained high GPU utilization.
  • Examples and a sample threshold table

    MetricThresholdAction
    tokens/sec (cluster)800kScale up + add 30% buffer
    tokens/sec (per tenant)2000Rate limit / notify tenant
    Hourly spend80% of budget/hourDisable large-model completions except whitelisted users
    Model GPU utilization>90% for 10minScale up / shift traffic to cheaper model

    These numbers are illustrative—your thresholds depend on your profiling and model mix.

    Operational tips and pitfalls

    Some gotchas I’ve learned the hard way:

  • Don’t rely on cloud billing in real time for autoscaling decisions—billing exports are delayed. Use telemetry and local cost models.
  • Be careful with batching: it improves throughput but increases tail latency for small requests and can spike token-per-request accounting if you forget to split generated tokens back to requests.
  • Test worst-case prompts during staging: long-context inputs can saturate memory and blow up token counts.
  • Beware of hidden costs like networking (large context windows across GPUs), storage for long-term logs, and model loading times (cold-start costs).
  • If you want, I can turn this into a checklist or provide a sample Kubernetes HPA manifest and a small proxy script that emits tokens/sec to Prometheus. Tell me which model stack you’re using (OpenAI, self-hosted Hugging Face, or a managed inference service) and I’ll tailor the implementation details.


    You should also check the following news:

    Cybersecurity

    How to detect and remove covert data exfiltration in android apps using only a cheap phone and free tools

    05/05/2026

    I remember the first time I realized an app on my cheap Android phone was quietly siphoning data: battery would drain a little faster, my monthly...

    Read more...
    How to detect and remove covert data exfiltration in android apps using only a cheap phone and free tools
    Guides

    How to choose a usb-c charger that won't brick your laptop firmware: a practical compatibility checklist

    08/05/2026

    I learned the hard way that not all USB‑C chargers are created equal. A year ago I had a close call: a third‑party GaN brick supplied the wrong...

    Read more...
    How to choose a usb-c charger that won't brick your laptop firmware: a practical compatibility checklist