arXiv cs.CL
6/16/2026

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents
Short summary
Research compares augmented web agents (with memory/workflow/skill modules) against budget-matched vanilla baselines. Across three LLM models and multiple domains, vanilla actors often match or exceed augmented performance while using fewer total tokens. The findings suggest apparent gains from augmentation frequently vanish under real token constraints.
- •Augmented agents don't consistently outperform vanilla baselines when token budgets are matched
- •Study spans Gemini Flash, GPT-5.4-mini, and Qwen 3.6-27B across WebArena and WorkArena tasks
- •Run-to-run variance is material and should be reported as core evaluation criterion
Generated with AI, which can make mistakes.
Is this a good recommendation for you?