Back to feed
arXiv cs.LG
arXiv cs.LG
5/13/2026
TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Short summary

TMPO (Trajectory Matching Policy Optimization) addresses reward hacking in diffusion model alignment by matching trajectory reward distributions instead of maximizing scalar rewards, preserving output diversity. It incorporates Dynamic Stochastic Tree Sampling to reduce training costs on large-scale models. Experiments demonstrate 9.1% improvement in generative diversity while maintaining competitive downstream performance.

  • New method to prevent reward hacking in diffusion models by using trajectory-level distribution matching
  • Reduces training costs through Dynamic Stochastic Tree Sampling with shared denoising prefixes
  • Achieves 9.1% improvement in output diversity over state-of-the-art methods

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more