Dev.to
5/8/2026

A protocol for auditing AI agent harnesses
Short summary
Three papers from Tsinghua, Fudan, and Stanford (March 2026) reveal why common agent harness optimizations fail: verifiers and multi-candidate sampling regress accuracy by 5-8pp because they recycle the doer's failure modes instead of introducing genuinely new signals. The protocol identifies this via a three-layer audit (L0 trace utility, L1 module ablation, L2 manifest verification) that predicts which changes help and which hurt, without running full benchmarks. Immediately actionable for anyone building coding agents or AI systems.
- •Verifiers and multi-candidate sampling fail because they share the baseline agent's blind spots; same-model components regress OSWorld accuracy 5-8pp
- •Rule: harness modules win only if they introduce new signals; recycled signals always lose
- •Three-layer audit (L0→L1→L2) predicts fix precision (33.7% vs random baseline) and catches failures before benchmark runs
Generated with AI, which can make mistakes.
Is this a good recommendation for you?



