GRPO with Verifiable (Binary) Rewards Is an Adaptive Weighted Contrastive Loss

1. Grouped Reward Policy Optimization

The goal of this short blog is to understand GRPO that was used successfully to train Deepseek models. We will limit our analysis to binary rewards or what Tulu authors calls RLVR (Reinforcement learning with Verifiable Rewards.)

Understanding GRPO with Rule Based (Binary) Reward

GRPO has been successfully used in DeepSeek (v3,math, and R1) especially with rule based rewards (verifiable rewards, i.e binary rewards), the goal of this note is to understand it. GRPO optimizes the following objective:

maxθEqEoπθold(.|q)f(πθ(o|q)πθold(o|q),A(o))βKL(πθ||πref)

Where the advantage function A(o) is defined as follows :

A(o)=r(o)μσ

where μ=Eoπθold(.|q)r(o) and σ=Voπθold(.|q)r(o).

And

f(x,y)=min(xy,clip(x,1ε,1+ε)y)

Note that in our context x=πθ(o|q)πθold(o|q)>0 and the advantage A(o) can be positive or negative and hence if A(o)>0 we have :

f(πθ(o|q)πθold(o|q),A(o))=A(o)min(πθ(o|q)πθold(o|q),1+ε)

and if A(o)<0

f(πθ(o|q)πθold(o|q),A(o))=A(o)max(πθ(o|q)πθold(o|q),1ε)

We will assume that we have a rule based reward that evaluates correctness of a reasoning or the execution of the code meaning that r(o)0,1. We note : p:=pθold(q)=Poπθold(r(o)=1)   probability of success Hence we have for mean and variance of a Bernoulli random variable :μ=p and σ=p(1p). We assume here 0<p<1 to not to have to deal with singularities. Hence replacing mean and variance in the advantage function:

A(o)={1pp(1p)if r(o)=1, pp(1p)if r(o)=0.

which simplifies to : A(o)={1ppif r(o)=1, p(1p)if r(o)=0.

Image alt

Hence we have:

Eoπθold(.|q)f(πθ(o|q)πθold(o|q),A(o))=

1ppEoπθold(.|q)min(πθ(o|q)πθold(o|q),1+ε)1r(o)=1

p(1p)Eoπθold(.|q)max(πθ(o|q)πθold(o|q),1ε)1r(o)=0

and hence the overall cost is obtained by taking expectation over q, note that p=pθold(q):

Eq1pθold(q)pθold(q)Eoπθold(.|q)min(πθ(o|q)πθold(o|q),1+ε)1r(o)=1

Eqpθold(q)(1pθold(q))Eoπθold(.|q)max(πθ(o|q)πθold(o|q),1ε)1r(o)=0

βKL(πθ||πref)

full derivation available here

Loss Interpretation

We see that GRPO is effectively a weighted contrastive loss that is weighted by a ratio depending on the probability of success of the old policy (p):

We see from the weights plots that :

  • if the success probability of old policy is high (say p >0.5), the weighting for points with success is low since the old policy is already good, and for failing point the weight is high and hence they are more penalized
  • If the success probability of old policy is low (say p <0.5), the weighting for points with success is high since we want to reinforce those successes, and for failing points these are still penalized but with a small weight

More observations due to clipping:

  • for correct outputs the cost is constant (1+ε) if πθ(o|q)(1+ε)πθold(o|q)
  • for wrong outputs the cost is (1ε) if πθ(o|q)(1ε)πθold(o|q),

2 Conclusion

In summary, the standardized reward or the advantage function used in GRPO results in an interesting adaptive weighted contrastive loss : if the probability of success of the old policy is high, the wrong answers are more penalized than the correct ones are reinforced. If the probability of success of old policy is low , the correct answers are more reinforced than the wrong answers are penalized.

References

  • Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).
  • Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).
  • Lambert, Nathan, et al. “TULU 3: Pushing Frontiers in Open Language Model Post-Training.” arXiv preprint arXiv:2411.15124 (2024).
Youssef Mroueh
Youssef Mroueh
Research Staff Member