GRPO Amplifies the Success Rate of A Policy via An implicit Fixed Point Iteration

In the previous post, we established that GRPO with verifiable rewards can be seen as a weighted contastive policy optimization, where positive and negative samples are synthetic data sampled from the old policy and labeled via the verifiable reward. In this post we will review our recent preprint that builds on this observation and analyses the dynamics of GRPO and the probability of success of the verifiable reward under GRPO’s optimized policy.

1 GRPO’s Success Rate Recursion

GRPO with no Clipping Equivalent Contrastive Loss formulation

For ε>0, and for p[0,1] define the following weights:

ω+,ε(p)=1pp(1p)+ε ω,ε(p)=pp(1p)+ε

Let πθold be the old policy and let pθold(q) be the probability of success of the verifiable reward r under πθold for a given prompt q: pθold(q)=Eoπθold(.|q)1r(q,o)=1

We can write GRPO with no clipping as the following contrastive optimization:

maxθL(θ),

where L(θ)=

Eqω+,ε(pθold(q))Eoπθ(.|q)1r(q,o)=1Eqω,ε(pθold(q))Eoπθ(.|q)1r(q,o)=0 βKL(πθ||πref).

GRPO Iterations and Optimal Policy

We drop now the optimization on the parameter space and optimize on the space of policies hence GRPO iterations can be written follows for n1:

πn=argmaxπLn1(π) where Ln1(π)= Eq(ω+,ε(pn1(q))Eoπ(.|q)1r(q,o)=1ω,ε(pn1(q))Eoπ(.|q)1r(q,o)=0) βKL(π||πref),

and pn1(q) is the probability of success of the policy πn1(|q) and π0=πref.

The optimal policy for n1 can be derived as follows:

πn(o|q)=1Zn1(q)πref(o|q)exp(1β(ωε+(pn1(q))1r(q,o)=1ωε(pn1(q))1r(q,o)=0)),

where

Zn1(q)=pref(q)exp(1βωε+(pn1(q)))+(1pref(q))exp(1βωε(pn1(q))).

GRPO’s Probability of Success Recursion

Computing now the success rate pn(q) of the policy πn(|q), and noting pref(q) the success rate of πref(|q), we obtain the following recursion:

pn(q)=hε,pref(q)(pn1(q)),

where hε,pref(p)=11+1prefprefexp(1β1p(1p)+ε)

Image alt

2 GRPO Amplifies Success Rate via An Implicit Fixed Point Iteration

We see that the success rate satisfies a fixed point iterations and the limit of this recursion as n for each prompt q, p(q) (the point in the curves above intersecting the function hε,pref and the line y=p). We show in the paper that under mild conditions on β:

  • the sequence pn(q) converges locally to the the fixed point p(q)
  • the fixed point success rate p(q) is guaranteed to be larger then the reference success rate pref(q): p(q)>pref(q).

Image alt

3 Conclusion

In summary, GRPO with verifiable rewards iterations leads to a fixed point iteration on the success rate of the policy. Under mild conditions, the limit success rate is guaranteed to amplify the success rate of the reference model and the local convergence of the iterates to the limit point is also guaranteed.

Youssef Mroueh
Youssef Mroueh
Research Staff Member