You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, appreciate to your contribution to the community!
I have also reproduced your experiments but using GRPO. However, starting from Qwen2.5-math-base-7b with RL training, I noticed that most responses are in program-of-thought style.
Here are some solution examples on MATH500:
To solve the problem of rotating the complex number ( z = 2 + \sqrt{2} - (3 + 3\sqrt{2})i ) around the complex number ( c = 2 - 3i ) by
(\frac{\pi}{4}) counter-clockwise, we can follow these steps:
Translate the complex number ( z ) so that ( c ) becomes the origin. This is done by subtracting ( c ) from ( z ).
Rotate the translated complex number by (\frac{\pi}{4}) counter-clockwise. This is done by multiplying the translated complex number by ( e^{i\frac{\pi}{4}} = \frac{1+i}{\sqrt{2}} ).
Translate the result back by adding ( c ).
Let's break this down step-by-step and implement it in Python using sympy.
importsympyassp# Define the complex numbers z and cz=2+sp.sqrt(2) - (3+3*sp.sqrt(2))*sp.Ic=2-3*sp.I# Translate z so that c becomes the originz_translated=z-c# Rotate the translated complex number by pi/4 counter-clockwiserotation_factor=sp.exp(sp.I*sp.pi/4)
z_rotated=z_translated*rotation_factor# Translate the result back by adding cw=z_rotated+c# Simplify the resultw_simplified=sp.simplify(w)
print(w_simplified)
Yes, we observed similar phenomena, which we also mentioned in our Notion:
“We observe a length decrease in the initial stage because we found the Qwen2.5-Math-7B base model tended to generate both language and code in the response, resulting in lengthy outputs. This default pattern is quickly discouraged throughout RL and the model learns to output in a more appropriate format, and then the length starts to increase regularly. After just a few training steps, we also experienced the "aha moment" described in the DeepSeek-R1 paper — the emergence of self reflection in the model's responses.”
Hi, appreciate to your contribution to the community!
I have also reproduced your experiments but using GRPO. However, starting from Qwen2.5-math-base-7b with RL training, I noticed that most responses are in program-of-thought style.
Here are some solution examples on MATH500:
And another one:
I think this pattern may hinder the model to develop self-reflection or self-correction. May I know if you have observed similar issues?
Best regards,
Fangkai
The text was updated successfully, but these errors were encountered: