Dev.to
5/12/2026

I shipped local LLM features two months ago. Production never ran them once.
Short summary
Author deployed local Gemma 4 inference for TextStack's vocabulary learning features but discovered after 60 days the Ollama container never pulled the model, failing silently behind a fallback. Through model swaps and production testing, they determined gemma4:e2b was optimal, achieving 100% success on 63,000 requests with p95=20.5ms latency. This eliminated the $2.50/book/user OpenAI cost, making local inference economically viable for self-hosted deployments.
- •Silent failure: local LLM features deployed but never executed for 60+ days due to missing model pull
- •Two model swaps revealed e2b was correct despite initial selection of e4b through production data analysis
- •Cost transformation: reduced per-distractor cost from ~$0.05 (cloud) to ~$0 (local), enabling self-hosting
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



