arXiv cs.LG
5/13/2026

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment
Short summary
TMPO (Trajectory Matching Policy Optimization) addresses reward hacking in diffusion model alignment by matching trajectory reward distributions instead of maximizing scalar rewards, preserving output diversity. It incorporates Dynamic Stochastic Tree Sampling to reduce training costs on large-scale models. Experiments demonstrate 9.1% improvement in generative diversity while maintaining competitive downstream performance.
- •New method to prevent reward hacking in diffusion models by using trajectory-level distribution matching
- •Reduces training costs through Dynamic Stochastic Tree Sampling with shared denoising prefixes
- •Achieves 9.1% improvement in output diversity over state-of-the-art methods
Generated with AI, which can make mistakes.
Is this a good recommendation for you?