Dev.to
6/18/2026

Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task
Short summary
An experimental benchmark compares Claude Sonnet 4 against five local open-source models (Qwen, Gemma, Hermes, Devstral) on a real agentic coding task—building a blog admin tag manager with API routes and UI. Results: Sonnet 4 succeeded cleanly in 10 minutes with zero intervention; only one local model (Qwen3-Coder 30B) shipped partial code; the rest failed. Verdict: local models on consumer GPUs aren't ready for complex coding tasks yet.
- •Sonnet 4 completed the coding task flawlessly in ~10 minutes with zero human intervention and perfect first-try build
- •Only one local model (Qwen3-Coder 30B) delivered working code, albeit incomplete and after significant struggle with file management
- •Remaining four local models (Qwen 35B, Gemma 12B, Hermes 14B, Devstral 24B) failed the task entirely on consumer hardware
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



