Chrome Extension
WeChat Mini Program
Use on ChatGLM

Generalized Preference Optimization: A Unified Approach to Offline Alignment

ICML 2024(2024)

Google DeepMind | DeepMind | Google DeepMind Inria MVA

Cited 78|Views75
Abstract
Offline preference optimization allows fine-tuning large models directly fromoffline data, and has proved effective in recent alignment practices. Wepropose generalized preference optimization (GPO), a family of offline lossesparameterized by a general class of convex functions. GPO enables a unifiedview over preference optimization, encompassing existing algorithms such asDPO, IPO and SLiC as special cases, while naturally introducing new variants.The GPO framework also sheds light on how offline algorithms enforceregularization, through the design of the convex function that defines theloss. Our analysis and experiments reveal the connections and subtledifferences between the offline regularization and the KL divergenceregularization intended by the canonical RLHF formulation. In all, our resultspresent new algorithmic toolkits and empirical insights to alignmentpractitioners.
More
Translated text
Key words
Constraint Optimization,Algorithm Selection,Distributed Algorithms,Group Decision Making
PDF
Bibtex
AI Read Science
Must-Reading Tree
Example
Generate MRT to find the research sequence of this paper
Related Papers
OpenAI,Josh Achiam,Steven Adler,Sandhini Agarwal,Lama Ahmad,Ilge Akkaya, Florencia Leoni Aleman,Diogo Almeida,Janko Altenschmidt,Sam Altman, Shyamal Anadkat, Red Avila,
2023

被引用5481 | 浏览

Data Disclaimer
The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn
Chat Paper

要点】:本文提出了一类离线损失函数——通用偏好优化(GPO),它通过参数化凸函数统一了偏好优化的方法,并覆盖了现有的算法,同时自然引入了新变体。

方法】:本文采用了一种由凸函数定义的离线损失框架,即GPO,来统一偏好优化。

实验】:通过对GPO的分析与实验,揭示了离线算法如何通过损失中定义的凸函数来实施正则化,并与标准强化学习人类反馈(RLHF)公式的KL散度正则化进行了比较,为对齐实践者提供了新的算法工具和实证见解。