Back to feed
arXiv cs.LG
arXiv cs.LG
5/13/2026
$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

$\xi$-DPO: Direct Preference Optimization via Ratio Reward Margin

Short summary

ξ-DPO reformulates preference optimization by introducing a ratio reward margin that eliminates hyperparameter tuning of β and γ in SimPO. The margin is directly interpretable from initial reward gap distributions, avoiding trial-and-error. This simplifies reference-free RLHF while maintaining theoretical grounding.

  • Solves SimPO's hyperparameter tuning challenge through mathematical reformulation of the objective
  • Ratio reward margin (ξ) is determinable from data, eliminating repeated tuning cycles
  • Directly applicable to LLM preference optimization and reference-free RLHF workflows

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more