PoT style response? #11

SparkJiao · 2025-01-27T06:38:26Z

Hi, appreciate to your contribution to the community!

I have also reproduced your experiments but using GRPO. However, starting from Qwen2.5-math-base-7b with RL training, I noticed that most responses are in program-of-thought style.

Here are some solution examples on MATH500:

To solve the problem of rotating the complex number ( z = 2 + \sqrt{2} - (3 + 3\sqrt{2})i ) around the complex number ( c = 2 - 3i ) by
(\frac{\pi}{4}) counter-clockwise, we can follow these steps:

Translate the complex number ( z ) so that ( c ) becomes the origin. This is done by subtracting ( c ) from ( z ).

Rotate the translated complex number by (\frac{\pi}{4}) counter-clockwise. This is done by multiplying the translated complex number by ( e^{i\frac{\pi}{4}} = \frac{1+i}{\sqrt{2}} ).

Translate the result back by adding ( c ).

Let's break this down step-by-step and implement it in Python using sympy.
import sympy as sp

# Define the complex numbers z and c
z = 2 + sp.sqrt(2) - (3 + 3*sp.sqrt(2))*sp.I
c = 2 - 3*sp.I

# Translate z so that c becomes the origin
z_translated = z - c

# Rotate the translated complex number by pi/4 counter-clockwise
rotation_factor = sp.exp(sp.I * sp.pi / 4)
z_rotated = z_translated * rotation_factor

# Translate the result back by adding c
w = z_rotated + c

# Simplify the result
w_simplified = sp.simplify(w)
print(w_simplified)
2 + (-1)**(1/4)*sqrt(2) + (-1)**(1/4)*sqrt(2)*(3 + 3*I) - 3*I
The output from the code is a bit complex, so let's simplify it further manually. The expression we have is:

[ w = 2 + (-1)^{1/4} \sqrt{2} + (-1)^{1/4} \sqrt{2} (3 + 3i) - 3i ]

We know that ((-1)^{1/4} = \frac{1+i}{\sqrt{2}}). Substituting this in, we get:

[ w = 2 + \frac{1+i}{\sqrt{2}} \sqrt{2} + \frac{1+i}{\sqrt{2}} \sqrt{2} (3 + 3i) - 3i ]
[ w = 2 + (1+i) + (1+i)(3+3i) - 3i ]
[ w = 2 + 1 + i + (1+i)(3+3i) - 3i ]
[ w = 3 + i + (3 + 3i + 3i + 3i^2) - 3i ]
[ w = 3 + i + (3 + 6i - 3) - 3i ]
[ w = 3 + i + 6i - 3i ]
[ w = 3 + 4i ]

So the final answer is:

[ \boxed{3 + 4i} ]

And another one:

To solve the problem (1 - 2 + 3 - 4 + 5 - \dots + 99 - 100), we can break it down into a series of steps.

Observe the pattern: The series alternates between positive and negative terms.

Group the terms in pairs: ((1 - 2) + (3 - 4) + (5 - 6) + \dots + (99 - 100)).

Each pair ((n - (n+1))) evaluates to (-1).

Since there are 100 terms, there are 50 pairs.

Therefore, the sum of all pairs is (50 \times (-1) = -50).

Let's implement this in Python to verify our reasoning.
# Calculate the sum of the series 1 - 2 + 3 - 4 + 5 - ... + 99 - 100
total_sum = sum(i if i % 2 != 0 else -i for i in range(1, 101))
print(total_sum)
-50
The sum of the series (1 - 2 + 3 - 4 + 5 - \dots + 99 - 100) is (\boxed{-50}).

I think this pattern may hinder the model to develop self-reflection or self-correction. May I know if you have observed similar issues?

Best regards,
Fangkai

The text was updated successfully, but these errors were encountered:

Zeng-WH · 2025-01-27T07:20:49Z

Yes, we observed similar phenomena, which we also mentioned in our Notion:

“We observe a length decrease in the initial stage because we found the Qwen2.5-Math-7B base model tended to generate both language and code in the response, resulting in lengthy outputs. This default pattern is quickly discouraged throughout RL and the model learns to output in a more appropriate format, and then the length starts to increase regularly. After just a few training steps, we also experienced the "aha moment" described in the DeepSeek-R1 paper — the emergence of self reflection in the model's responses.”

https://hkust-nlp.notion.site/simplerl-reason

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoT style response? #11

PoT style response? #11

SparkJiao commented Jan 27, 2025

Zeng-WH commented Jan 27, 2025

PoT style response? #11

PoT style response? #11

Comments

SparkJiao commented Jan 27, 2025

Zeng-WH commented Jan 27, 2025