Back to feed
Dev.to
Dev.to
5/12/2026
I shipped local LLM features two months ago. Production never ran them once.

I shipped local LLM features two months ago. Production never ran them once.

Short summary

Author deployed local Gemma 4 inference for TextStack's vocabulary learning features but discovered after 60 days the Ollama container never pulled the model, failing silently behind a fallback. Through model swaps and production testing, they determined gemma4:e2b was optimal, achieving 100% success on 63,000 requests with p95=20.5ms latency. This eliminated the $2.50/book/user OpenAI cost, making local inference economically viable for self-hosted deployments.

  • Silent failure: local LLM features deployed but never executed for 60+ days due to missing model pull
  • Two model swaps revealed e2b was correct despite initial selection of e4b through production data analysis
  • Cost transformation: reduced per-distractor cost from ~$0.05 (cloud) to ~$0 (local), enabling self-hosting

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more