Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoT style response? #11

Open
SparkJiao opened this issue Jan 27, 2025 · 1 comment
Open

PoT style response? #11

SparkJiao opened this issue Jan 27, 2025 · 1 comment

Comments

@SparkJiao
Copy link

Hi, appreciate to your contribution to the community!

I have also reproduced your experiments but using GRPO. However, starting from Qwen2.5-math-base-7b with RL training, I noticed that most responses are in program-of-thought style.

Here are some solution examples on MATH500:

To solve the problem of rotating the complex number ( z = 2 + \sqrt{2} - (3 + 3\sqrt{2})i ) around the complex number ( c = 2 - 3i ) by
(\frac{\pi}{4}) counter-clockwise, we can follow these steps:

  1. Translate the complex number ( z ) so that ( c ) becomes the origin. This is done by subtracting ( c ) from ( z ).
  2. Rotate the translated complex number by (\frac{\pi}{4}) counter-clockwise. This is done by multiplying the translated complex number by ( e^{i\frac{\pi}{4}} = \frac{1+i}{\sqrt{2}} ).
  3. Translate the result back by adding ( c ).

Let's break this down step-by-step and implement it in Python using sympy.

import sympy as sp

# Define the complex numbers z and c
z = 2 + sp.sqrt(2) - (3 + 3*sp.sqrt(2))*sp.I
c = 2 - 3*sp.I

# Translate z so that c becomes the origin
z_translated = z - c

# Rotate the translated complex number by pi/4 counter-clockwise
rotation_factor = sp.exp(sp.I * sp.pi / 4)
z_rotated = z_translated * rotation_factor

# Translate the result back by adding c
w = z_rotated + c

# Simplify the result
w_simplified = sp.simplify(w)
print(w_simplified)
2 + (-1)**(1/4)*sqrt(2) + (-1)**(1/4)*sqrt(2)*(3 + 3*I) - 3*I

The output from the code is a bit complex, so let's simplify it further manually. The expression we have is:

[ w = 2 + (-1)^{1/4} \sqrt{2} + (-1)^{1/4} \sqrt{2} (3 + 3i) - 3i ]

We know that ((-1)^{1/4} = \frac{1+i}{\sqrt{2}}). Substituting this in, we get:

[ w = 2 + \frac{1+i}{\sqrt{2}} \sqrt{2} + \frac{1+i}{\sqrt{2}} \sqrt{2} (3 + 3i) - 3i ]
[ w = 2 + (1+i) + (1+i)(3+3i) - 3i ]
[ w = 2 + 1 + i + (1+i)(3+3i) - 3i ]
[ w = 3 + i + (3 + 3i + 3i + 3i^2) - 3i ]
[ w = 3 + i + (3 + 6i - 3) - 3i ]
[ w = 3 + i + 6i - 3i ]
[ w = 3 + 4i ]

So the final answer is:

[ \boxed{3 + 4i} ]

And another one:

To solve the problem (1 - 2 + 3 - 4 + 5 - \dots + 99 - 100), we can break it down into a series of steps.

  1. Observe the pattern: The series alternates between positive and negative terms.
  2. Group the terms in pairs: ((1 - 2) + (3 - 4) + (5 - 6) + \dots + (99 - 100)).
  3. Each pair ((n - (n+1))) evaluates to (-1).
  4. Since there are 100 terms, there are 50 pairs.
  5. Therefore, the sum of all pairs is (50 \times (-1) = -50).

Let's implement this in Python to verify our reasoning.

# Calculate the sum of the series 1 - 2 + 3 - 4 + 5 - ... + 99 - 100
total_sum = sum(i if i % 2 != 0 else -i for i in range(1, 101))
print(total_sum)
-50

The sum of the series (1 - 2 + 3 - 4 + 5 - \dots + 99 - 100) is (\boxed{-50}).

I think this pattern may hinder the model to develop self-reflection or self-correction. May I know if you have observed similar issues?

Best regards,
Fangkai

@Zeng-WH
Copy link
Collaborator

Zeng-WH commented Jan 27, 2025

Yes, we observed similar phenomena, which we also mentioned in our Notion:

“We observe a length decrease in the initial stage because we found the Qwen2.5-Math-7B base model tended to generate both language and code in the response, resulting in lengthy outputs. This default pattern is quickly discouraged throughout RL and the model learns to output in a more appropriate format, and then the length starts to increase regularly. After just a few training steps, we also experienced the "aha moment" described in the DeepSeek-R1 paper — the emergence of self reflection in the model's responses.”

https://hkust-nlp.notion.site/simplerl-reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants