Dev.to
5/10/2026

Model Showdown Round 3: Ditching Ollama in Favor of llama.cpp
Short summary
This benchmark compares five open-source models (Qwen 3.5, Gemma 4, Devstral, Codestral, DeepSeek R1) running directly on llama.cpp instead of Ollama, eliminating abstraction overhead. Qwen 3.5 achieved best performance across coding and agentic tasks at 206 tokens/second, with complete hardware specs, quantization choices, and deployment configs provided. The switch freed 44 GB of disk and enabled fine-grained control over context windows, batch sizes, and reasoning budgets for Coder Agents.
- •Benchmarked 5 local LLMs using llama.cpp directly instead of Ollama wrapper for hardware-level control
- •Qwen 3.5 (MoE) won across all categories at 206 tokens/sec, beating Gemma 4 and others
- •Provided complete systemd deployment configs with inference tuning flags and model-specific chat templates
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



