Author: Tianle Wang @ CityU Miao Lab.

Introduction

[1] demonstrated that reinforcement learning with verifiable reward using a single training example (1-shot RLVR) effectively enhances the mathematical reasoning capabilities of large language models (LLMs). Their empirical analysis reveals that the policy gradient loss plays a pivotal role in this effectiveness. Notably, they observed an unexpected phenomenon where Qwen2.5-Math-1.5B's performance on MATH500 improves by 27.4% through entropy regularization alone, without any outcome reward. As illustrated in Table 5 (rows 0 and 10) of their paper, even isolated entropy loss application yields about 25% performance gains. In this blog, we figure out the source of this surprising improvement.

image.png

Settings

We use the code provided by the original paper [1].

Training

The same as in [1]. Only the entropy loss is retained, and the coefficients of policy loss and kl loss and weight decay are set to 0.0. The training rollout temperature is set to 0.6 for vLLM. We use the same query to perform one-shot RL as [1].

Evaluation

Evaluation is performed on MATH500, with temperature set to 0, following [1].

Analysis

1. Is the result reproducible?

We conducted systematic replication experiments under identical settings to verify the reproducibility of the reported findings. The validation results on the MATH500 are summarized below:

Unknown.png

At global_step=0, the base model achieves an accuracy of 39.8%, and when global_step=60, the model achieves the best performance of 63.8%. These results precisely match the reported values in Table 5 (rows 0 and 10) of the original study, confirming the empirical reproducibility of the core findings.

2. Is the because of temperature used during evaluation?

While the original study [1] employed deterministic inference with temperature=0, we observed that in many practical scenarios, models exhibit stronger performance at higher evaluation temperatures (0.6). Our initial hypothesis posited that the performance gains attributed to entropy regularization might be temperature-dependent, given entropy's inherent connection to exploration.

However, evaluation on the MATH500 testset revealed a different pattern: the model achieves superior results under the deterministic setting (temperature=0) compared to the exploratory regime (temperature=0.6) at different stages of training: