Posts

GRPO Amplifies the Success Rate of A Policy via An implicit Fixed Point Iteration

In the previous post, we established that GRPO with verifiable rewards can be seen as a weighted contastive policy optimization, where positive and negative samples are synthetic data sampled from the old policy and labeled via the verifiable reward.

Last updated on Mar 17, 2025 3 min read

GRPO with Verifiable (Binary) Rewards Is an Adaptive Weighted Contrastive Loss

1. Grouped Reward Policy Optimization The goal of this short blog is to understand GRPO that was used successfully to train Deepseek models. We will limit our analysis to binary rewards or what Tulu authors calls RLVR (Reinforcement learning with Verifiable Rewards.

Last updated on Feb 3, 2025 4 min read