TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Short summary

TMPO (Trajectory Matching Policy Optimization) addresses reward hacking in diffusion model alignment by matching trajectory reward distributions instead of maximizing scalar rewards, preserving output diversity. It incorporates Dynamic Stochastic Tree Sampling to reduce training costs on large-scale models. Experiments demonstrate 9.1% improvement in generative diversity while maintaining competitive downstream performance.

•New method to prevent reward hacking in diffusion models by using trajectory-level distribution matching
•Reduces training costs through Dynamic Stochastic Tree Sampling with shared denoising prefixes
•Achieves 9.1% improvement in output diversity over state-of-the-art methods

Generated with AI, which can make mistakes.

#research-breakthrough #ai-tools

Read full article at arXiv cs.LG

Is this a good recommendation for you?

TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment

Short summary

Comments

Explore more