-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] InvertedDoublePendulumEnv
and InvertedPendulumEnv
always gives "alive_bonus"
#500
Comments
Yes, I think that is a reasonable thing to consider adding for v5. @rodrigodelazcano thoughts? |
The same appears to be the case for InvertedPendulumEnv
|
That is a good catch. I agree with @pseudo-rnd-thoughts . This should be added to v5 since v4 only updates to the mujoco bindings and this reward error comes from older versions as well. |
InvertedDoublePendulumEnv
always gives "alive_bonus"InvertedDoublePendulumEnv
and InvertedPendulumEnv
always gives "alive_bonus"
Great I will add it to v5 change list |
Here is some code verifying the bugs
|
|
This is a massive reward difference. |
The episodic reward being 10 points happens only if the episode terminates (which does not happen after some training regardless of the reward function). The best policy of all the cases resulted in the same return (~9360), it is just that with the fixed reward function it is possible to get there more consistently Note: I have double-checked the source codes, nothing is wrong there. |
No, your understanding of the change in the reward function is correct |
When why the ~4000 point difference? To me, if the agents were already collecting the optimal result then the difference should be on average 10 points |
Because on some runs with the old reward function, the agent is not able to learn how to "escape" an unbalanced state The optimal results are identical with both reward functions (since the "optimal" policy, would not be unbalanced) |
Wow, that is amazing if purely changing that variables causes such a massive change in performance |
Describe the bug
Gymnasium/gymnasium/envs/mujoco/inverted_double_pendulum_v4.py
Line 155 in c4f67b9
This is given regardless of if the step is
terminating
or notShouldn't it be
Code example
No response
System info
No response
Additional context
No response
Checklist
The text was updated successfully, but these errors were encountered: