arXiv cs.CL
5/12/2026

AIPO: : Learning to Reason from Active Interaction
Short summary
AIPO is a reinforcement learning framework that enhances LLM reasoning by enabling models to consult specialized collaborative agents when encountering reasoning bottlenecks. Unlike methods requiring complete trajectory guidance, AIPO uses fine-grained agent feedback to efficiently expand capability boundaries. Experiments across multiple reasoning benchmarks show consistent improvements with robust generalization.
- •Multi-agent interaction framework (Verify, Knowledge, Reasoning agents) guides LLM exploration
- •Improves reasoning capability boundary without sample-inefficient trajectory-level supervision
- •Validated across AIME, MATH500, GPQA-Diamond, LiveCodeBench with consistent gains
Generated with AI, which can make mistakes.
Is this a good recommendation for you?