GRPO with Verifiable (Binary) Rewards Is an Adaptive Weighted Contrastive Loss

Last updated on Feb 3, 2025 4 min read

1. Grouped Reward Policy Optimization

The goal of this short blog is to understand GRPO that was used successfully to train Deepseek models. We will limit our analysis to binary rewards or what Tulu authors calls RLVR (Reinforcement learning with Verifiable Rewards.)

Understanding GRPO with Rule Based (Binary) Reward

GRPO has been successfully used in DeepSeek (v3,math, and R1) especially with rule based rewards (verifiable rewards, i.e binary rewards), the goal of this note is to understand it. GRPO optimizes the following objective:

$max_{θ} E_{q} E_{o \sim π_{θ_{old}} (. | q)} f (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, A (o)) - β KL (π_{θ} | | π_{ref})$

Where the advantage function $A (o)$ is defined as follows :

$A (o) = \frac{r (o) - μ}{σ}$

where $μ = E_{o \sim π_{θ_{old}} (. | q)} r (o)$ and $σ = \sqrt{V_{o \sim π_{θ_{old} (. | q)}} r (o)}$ .

And

$f (x, y) = min (x y, clip (x, 1 - ε, 1 + ε) y)$

Note that in our context $x = \frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)} > 0$ and the advantage $A (o)$ can be positive or negative and hence if $A (o) > 0$ we have :

$f (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, A (o)) = A (o) min (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 + ε)$

and if $A (o) < 0$

$f (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, A (o)) = A (o) max (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 - ε)$

We will assume that we have a rule based reward that evaluates correctness of a reasoning or the execution of the code meaning that $r (o) \in 0, 1$ . We note : $p := p_{θ_{old}} (q) = P_{o \sim π_{θ_{old}}} (r (o) = 1) probability of success$ Hence we have for mean and variance of a Bernoulli random variable : $μ = p$ and $σ = p (1 - p)$ . We assume here $0 < p < 1$ to not to have to deal with singularities. Hence replacing mean and variance in the advantage function:

$A (o) = {\begin{cases} \frac{1 - p}{\sqrt{p (1 - p)}} & if r (o) = 1, - \frac{p}{\sqrt{p (1 - p)}} & if r (o) = 0. \end{cases}$

which simplifies to : $A (o) = {\begin{cases} \sqrt{\frac{1 - p}{p}} & if r (o) = 1, - \sqrt{\frac{p}{(1 - p)}} & if r (o) = 0. \end{cases}$

Image alt

Hence we have:

$E_{o \sim π_{θ_{old} (. | q)}} f (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, A (o)) =$

$\sqrt{\frac{1 - p}{p}} E_{o \sim π_{θ_{old}} (. | q)} min (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 + ε) 1_{r (o) = 1}$

$- \sqrt{\frac{p}{(1 - p)}} E_{o \sim π_{θ_{old}} (. | q)} max (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 - ε) 1_{r (o) = 0}$

and hence the overall cost is obtained by taking expectation over $q$ , note that $p = p_{θ_{old}} (q)$ :

$E_{q} \sqrt{\frac{1 - p_{θ_{old}} (q)}{p_{θ_{old}} (q)}} E_{o \sim π_{θ_{old}} (. | q)} min (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 + ε) 1_{r (o) = 1}$

$- E_{q} \sqrt{\frac{p_{θ_{old}} (q)}{(1 - p_{θ_{old}} (q))}} E_{o \sim π_{θ_{old}} (. | q)} max (\frac{π_{θ} (o | q)}{π_{θ_{old}} (o | q)}, 1 - ε) 1_{r (o) = 0}$

$- β KL (π_{θ} | | π_{ref})$

full derivation available here

Loss Interpretation

We see that GRPO is effectively a weighted contrastive loss that is weighted by a ratio depending on the probability of success of the old policy (p):

We see from the weights plots that :

if the success probability of old policy is high (say p >0.5), the weighting for points with success is low since the old policy is already good, and for failing point the weight is high and hence they are more penalized
If the success probability of old policy is low (say p <0.5), the weighting for points with success is high since we want to reinforce those successes, and for failing points these are still penalized but with a small weight

More observations due to clipping:

for correct outputs the cost is constant $(1 + ε)$ if $π_{θ} (o | q) \geq (1 + ε) π_{θ_{old}} (o | q)$
for wrong outputs the cost is $(1 - ε)$ if $π_{θ} (o | q) \leq (1 - ε) π_{θ_{old}} (o | q)$ ,

2 Conclusion

In summary, the standardized reward or the advantage function used in GRPO results in an interesting adaptive weighted contrastive loss : if the probability of success of the old policy is high, the wrong answers are more penalized than the correct ones are reinforced. If the probability of success of old policy is low , the correct answers are more reinforced than the wrong answers are penalized.

References

Guo, Daya, et al. “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.” arXiv preprint arXiv:2501.12948 (2025).
Shao, Zhihong, et al. “Deepseekmath: Pushing the limits of mathematical reasoning in open language models.” arXiv preprint arXiv:2402.03300 (2024).
Lambert, Nathan, et al. “TULU 3: Pushing Frontiers in Open Language Model Post-Training.” arXiv preprint arXiv:2411.15124 (2024).

GRPO with Verifiable (Binary) Rewards Is an Adaptive Weighted Contrastive Loss

1. Grouped Reward Policy Optimization

2 Conclusion

Youssef Mroueh

Research Staff Member