arXiv cs.LG
5/13/2026

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin
Short summary
ξ-DPO reformulates preference optimization by introducing a ratio reward margin that eliminates hyperparameter tuning of β and γ in SimPO. The margin is directly interpretable from initial reward gap distributions, avoiding trial-and-error. This simplifies reference-free RLHF while maintaining theoretical grounding.
- •Solves SimPO's hyperparameter tuning challenge through mathematical reformulation of the objective
- •Ratio reward margin (ξ) is determinable from data, eliminating repeated tuning cycles
- •Directly applicable to LLM preference optimization and reference-free RLHF workflows
Generated with AI, which can make mistakes.
Is this a good recommendation for you?