Information Theoretic Limits of Alignment in Large Language Models

1. The Alignment Problem

Reinforcement Learning with Human Feedback (RLHF): Constrained Policy Optimization
The alignment problem aims to align large language models (LLMs) with desired behaviors using constrained policy optimization. The objective is to find a policy that maximizes a reward $r$ while staying close to a reference policy $ \pi_{\text{ref}}$. Formally, this is expressed as:

$$\pi_{y|x}^* = \arg\max_{\pi_{y|x}} 𝔼_{x \sim \rho_{\mathcal{X}}} 𝔼_{y \sim \pi(y|x)} [r(x, y)]$$

Best of n Policy
The “Best of n” policy samples multiple outputs from the reference policy and selects the one with the highest reward. Given $n$ independent samples $Y_1, Y_2, \dots, Y_n$) from $\pi_{\text{ref}}$, the “Best of n” policy is defined as:

$$ Y^{(n)}|X = \underset{i=1, \dots, n}{\arg \max} \ r(X, Y_i) $$

This model encourages more robust alignment by optimizing over multiple options, with potential applications in reinforcement learning and human feedback loops.


2. Best of n Policy KL Guarantees

KL divergence bounds ensure that the “Best of n” policy stays close to the reference policy. Under certain conditions (e.g., one-to-one reward structure), the KL divergence for the “Best of n” policy satisfies:

$$ KL(\pi_{r, \text{ref}}^{(n)} || \pi_{\text{ref}}) \leq \log(n) - \frac{n - 1}{n} $$

This guarantee demonstrates that increasing $n4 yields tighter alignment with the reference policy.


3. Reward Guarantees via Transportation Inequalities

Transportation inequalities allow for bound guarantees on expected rewards for aligned policies, particularly when the reward distribution is subGaussian. If the reference reward $r_{\sharp} \pi_{\text{ref}}$ is subGaussian with variance $\sigma^2_{\text{ref}}$, then for any policy $\pi$ absolutely continuous with respect to $\pi_{\text{ref}}$, the following inequality holds:

$$ | 𝔼_{\pi}[r] - 𝔼_{\pi_{\text{ref}}}[r]| \leq \sqrt{2 \sigma^2_{\text{ref}} KL(\pi || \pi_{\text{ref}})} $$

This inequality offers insights into the reward behavior for aligned models, linking KL divergence to potential reward improvements.


4. Goodhart’s Law: Proxy vs. Golden Rewards

Optimizing a proxy reward $r$ as a stand-in for a “golden” reward $r^*$, which captures ideal human preferences, can lead to unintended consequences. Goodhart’s Law implies that maximizing the proxy may cause deviations from the golden reward as $KL(\pi || \pi_{\text{ref}})$ grows. Thus, balancing proxy and golden rewards is essential for maintaining alignment.

The additional term in the alignment error when using a proxy is captured by the total variation distance $TV(\pi^{(n)}{r, \text{ref}} | \pi{\text{ref}})$, which scales with $n$ and suggests careful calibration of the alignment model.


References

  • Boucheron, S., Lugosi, G., & Massart, P. Concentration Inequalities - A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • Gao, L., Schulman, J., & Hilton, J. Scaling Laws for Reward Model Overoptimization. ICML, 2023.
  • Beirami, A., Agarwal, A., & Berant, J. Theoretical Guarantees on the Best-of-n Alignment Policy, 2024.
Youssef Mroueh
Youssef Mroueh
Research Staff Member