I shipped local LLM features two months ago. Production never ran them once.

Short summary

Author deployed local Gemma 4 inference for TextStack's vocabulary learning features but discovered after 60 days the Ollama container never pulled the model, failing silently behind a fallback. Through model swaps and production testing, they determined gemma4:e2b was optimal, achieving 100% success on 63,000 requests with p95=20.5ms latency. This eliminated the $2.50/book/user OpenAI cost, making local inference economically viable for self-hosted deployments.

•Silent failure: local LLM features deployed but never executed for 60+ days due to missing model pull
•Two model swaps revealed e2b was correct despite initial selection of e4b through production data analysis
•Cost transformation: reduced per-distractor cost from ~$0.05 (cloud) to ~$0 (local), enabling self-hosting

Generated with AI, which can make mistakes.

#ai-tools #open-source #ai-agents #certification-education #product-launch #research-breakthrough

Read full article at Dev.to

Is this a good recommendation for you?

I shipped local LLM features two months ago. Production never ran them once.

Short summary

Explore more