Back to feed
Dev.to
Dev.to
6/18/2026
Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task

Model Showdown Round 7: Five Local Models vs. One Cloud Model on a Real Coding Task

Short summary

An experimental benchmark compares Claude Sonnet 4 against five local open-source models (Qwen, Gemma, Hermes, Devstral) on a real agentic coding task—building a blog admin tag manager with API routes and UI. Results: Sonnet 4 succeeded cleanly in 10 minutes with zero intervention; only one local model (Qwen3-Coder 30B) shipped partial code; the rest failed. Verdict: local models on consumer GPUs aren't ready for complex coding tasks yet.

  • Sonnet 4 completed the coding task flawlessly in ~10 minutes with zero human intervention and perfect first-try build
  • Only one local model (Qwen3-Coder 30B) delivered working code, albeit incomplete and after significant struggle with file management
  • Remaining four local models (Qwen 35B, Gemma 12B, Hermes 14B, Devstral 24B) failed the task entirely on consumer hardware

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more