GRPO Amplifies the Success Rate of A Policy via An implicit Fixed Point Iteration

Last updated on Mar 17, 2025 3 min read

In the previous post, we established that GRPO with verifiable rewards can be seen as a weighted contastive policy optimization, where positive and negative samples are synthetic data sampled from the old policy and labeled via the verifiable reward. In this post we will review our recent preprint that builds on this observation and analyses the dynamics of GRPO and the probability of success of the verifiable reward under GRPO’s optimized policy.

1 GRPO’s Success Rate Recursion

GRPO with no Clipping Equivalent Contrastive Loss formulation

For $ε > 0$ , and for $p \in [0, 1]$ define the following weights:

$ω_{+, ε} (p) = \frac{1 - p}{\sqrt{p (1 - p) + ε}}$ $ω_{-, ε} (p) = \frac{p}{\sqrt{p (1 - p) + ε}}$

Let $π_{θ_{old}}$ be the old policy and let $p_{θ_{old}} (q)$ be the probability of success of the verifiable reward $r$ under $π_{θ_{old}}$ for a given prompt $q$ : $p_{θ_{old}} (q) = E_{o \sim π_{θ_{old}} (. | q)} 1_{r (q, o) = 1}$

We can write GRPO with no clipping as the following contrastive optimization:

$max_{θ} L (θ),$

where $L (θ) =$

$E_{q} ω_{+, ε} (p_{θ_{old}} (q)) E_{o \sim π_{θ (. | q)}} 1_{r (q, o) = 1} - E_{q} ω_{-, ε} (p_{θ_{old}} (q)) E_{o \sim π_{θ (. | q)}} 1_{r (q, o) = 0}$ $- β KL (π_{θ} | | π_{ref}) .$

GRPO Iterations and Optimal Policy

We drop now the optimization on the parameter space and optimize on the space of policies hence GRPO iterations can be written follows for $n \geq 1$ :

$π_{n} = a r g max_{π} L_{n - 1} (π)$ where $L_{n - 1} (π) =$ $E_{q} (ω_{+, ε} (p_{n - 1} (q)) E_{o \sim π (. | q)} 1_{r (q, o) = 1} - ω_{-, ε} (p_{n - 1} (q)) E_{o \sim π (. | q)} 1_{r (q, o) = 0})$ $- β KL (π | | π_{ref}),$

and $p_{n - 1} (q)$ is the probability of success of the policy $π_{n - 1} (\cdot | q)$ and $π_{0} = π_{ref}$ .

The optimal policy for $n \geq 1$ can be derived as follows:

$π_{n} (o | q) = \frac{1}{Z_{n - 1} (q)} π_{ref} (o | q) \exp (\frac{1}{β} (ω_{ε}^{+} (p_{n - 1} (q)) 1_{r (q, o) = 1} - ω_{ε}^{-} (p_{n - 1} (q)) 1_{r (q, o) = 0})),$

where

$Z_{n - 1} (q) = p_{ref} (q) \exp (\frac{1}{β} ω_{ε}^{+} (p_{n - 1} (q))) + (1 - p_{ref} (q)) \exp (- \frac{1}{β} ω_{ε}^{-} (p_{n - 1} (q))) .$

GRPO’s Probability of Success Recursion

Computing now the success rate $p_{n} (q)$ of the policy $π_{n} (\cdot | q)$ , and noting $p_{ref} (q)$ the success rate of $π_{ref} (\cdot | q)$ , we obtain the following recursion:

$p_{n} (q) = h_{ε, p_{ref} (q)} (p_{n - 1} (q)),$

where $h_{ε, p_{ref}} (p) = \frac{1}{1 + \frac{1 - p_{ref}}{p_{ref}} \exp (- \frac{1}{β} \frac{1}{\sqrt{p (1 - p) + ε}})}$

Image alt

2 GRPO Amplifies Success Rate via An Implicit Fixed Point Iteration

We see that the success rate satisfies a fixed point iterations and the limit of this recursion as $n \to \infty$ for each prompt $q$ , $p^{*} (q)$ (the point in the curves above intersecting the function $h_{ε, p_{ref}}$ and the line $y = p$ ). We show in the paper that under mild conditions on $β$ :

the sequence $p_{n} (q)$ converges locally to the the fixed point $p^{*} (q)$
the fixed point success rate $p^{*} (q)$ is guaranteed to be larger then the reference success rate $p_{ref} (q)$ : $p^{*} (q) > p_{ref} (q)$ .

Image alt

3 Conclusion

In summary, GRPO with verifiable rewards iterations leads to a fixed point iteration on the success rate of the policy. Under mild conditions, the limit success rate is guaranteed to amplify the success rate of the reference model and the local convergence of the iterates to the limit point is also guaranteed.

GRPO Amplifies the Success Rate of A Policy via An implicit Fixed Point Iteration

1 GRPO’s Success Rate Recursion

2 GRPO Amplifies Success Rate via An Implicit Fixed Point Iteration

3 Conclusion

Youssef Mroueh

Research Staff Member