Back to feed
Dev.to
Dev.to
5/12/2026
One AI Model Scored 99. I Still Voted for the One That Scored 95.

One AI Model Scored 99. I Still Voted for the One That Scored 95.

Short summary

Claude 3 Haiku scored 99 vs Llama-4-Scout's 95 on VibeCode Arena's automated evaluation metrics, but the author preferred Llama's output for its superior UX feel and closer alignment with the original brief. The core insight: technical quality scores and actual user preference don't correlate perfectly, revealing a critical gap in how developers evaluate AI-generated software. As AI code generation becomes commodity, the bottleneck shifts from generation to evaluation—developers must judge across multiple overlapping dimensions: code cleanliness, hidden implementation quality, accessibility, and raw user experience.

  • Claude scored higher (99 vs 95) but Llama felt better in actual use
  • Technical metrics and user preference don't always align—both matter
  • Evaluation, not generation, is now the bottleneck for AI-generated code

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more