Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Report] InvertedDoublePendulumEnv and InvertedPendulumEnv always gives "alive_bonus" #500

Closed
1 task done
Kallinteris-Andreas opened this issue May 11, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@Kallinteris-Andreas
Copy link
Collaborator

Kallinteris-Andreas commented May 11, 2023

Describe the bug

This is given regardless of if the step is terminating or not

Shouldn't it be

 alive_bonus = 10 * (not terminated)

Code example

No response

System info

No response

Additional context

No response

Checklist

  • I have checked that there is no similar issue in the repo
@Kallinteris-Andreas Kallinteris-Andreas added the bug Something isn't working label May 11, 2023
@pseudo-rnd-thoughts
Copy link
Member

Yes, I think that is a reasonable thing to consider adding for v5. @rodrigodelazcano thoughts?

@Kallinteris-Andreas
Copy link
Collaborator Author


The same appears to be the case for InvertedPendulumEnv

@rodrigodelazcano
Copy link
Member

That is a good catch. I agree with @pseudo-rnd-thoughts . This should be added to v5 since v4 only updates to the mujoco bindings and this reward error comes from older versions as well.

@Kallinteris-Andreas Kallinteris-Andreas changed the title [Bug Report] InvertedDoublePendulumEnv always gives "alive_bonus" [Bug Report] InvertedDoublePendulumEnv and InvertedPendulumEnv always gives "alive_bonus" May 11, 2023
@Kallinteris-Andreas
Copy link
Collaborator Author

Great I will add it to v5 change list

@Kallinteris-Andreas
Copy link
Collaborator Author

Here is some code verifying the bugs

>>> env = gymnasium.make('InvertedPendulum-v4')
>>> env.reset()
(array([-0.00114481,  0.00315834, -0.00689603, -0.00764207]), {})
>>> env.step([1])
(array([ 0.0052199 , -0.01239018,  0.32425438, -0.76226102]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.02474693, -0.05746427,  0.65169342, -1.48966764]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.05732965, -0.13159401,  0.97709572, -2.21890001]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.10286945, -0.23521879,  1.29895519, -2.96571882]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.16112042, -0.36907483,  1.6112052 , -3.72861976]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.23148975, -0.53346372,  1.902614  , -4.48774083]), 1.0, True, False, {})
>>> env = gymnasium.make('InvertedDoublePendulum-v4')
>>> env.reset()
(array([-0.05209413, -0.03106399, -0.05757982,  0.9995174 ,  0.99834091,
       -0.00319314, -0.10766195,  0.09683618,  0.        ,  0.        ,
        0.        ]), {})
>>> env.step([1])
(array([ 7.67962813e-04, -1.44606909e-01,  8.22320453e-02,  9.89489182e-01,
        9.96613210e-01,  2.11193186e+00, -4.45134196e+00,  5.49346477e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00]), 9.17710405832815, False, False, {})
>>> env.step([1])
(array([ 0.15545075, -0.44371616,  0.43168352,  0.89616738,  0.90202513,
        3.9987776 , -7.74866516,  7.9830774 ,  0.        ,  0.        ,
        0.        ]), 8.877556821859912, False, False, {})
>>> env.step([1])
(array([ 0.39199627, -0.77051144,  0.69186829,  0.63742617,  0.72202373,
        5.38530052, -8.71215195,  3.8089483 ,  0.        ,  0.        ,
        0.        ]), 8.807853136081622, True, False, {})

@Kallinteris-Andreas
Copy link
Collaborator Author

Kallinteris-Andreas commented May 13, 2023

Pendulum-v4-vs-v5
Here is the v4 vs v5 for InvertedPendulum (the only difference in v5 is the reward being fixed)
As expected the v5 version has a faster learning transience

@Kallinteris-Andreas
Copy link
Collaborator Author


v4 is the current v4 version
v4-fixed is the current v4 version, with the reward_alive fixed
v5 is the current v4 version, with the reward_alive fixed and the observation fix (#228)

@pseudo-rnd-thoughts
Copy link
Member

This is a massive reward difference.
This shouldn't explain to me the performance difference as the primary difference is if terminated=True as alive_bonus = 0 so I expected that the episode reward might be 10 points lower.
@Kallinteris-Andreas Am I misunderstanding something?

@Kallinteris-Andreas
Copy link
Collaborator Author

The episodic reward being 10 points happens only if the episode terminates (which does not happen after some training regardless of the reward function).

The best policy of all the cases resulted in the same return (~9360), it is just that with the fixed reward function it is possible to get there more consistently

Note: I have double-checked the source codes, nothing is wrong there.

@pseudo-rnd-thoughts
Copy link
Member

That doesn't explain the ~4000 point increase shown by

To me, the change to the reward function is only when terminated=True such that reward_alive=0. Have I misunderstood something?

@Kallinteris-Andreas
Copy link
Collaborator Author

Kallinteris-Andreas commented May 15, 2023

No, your understanding of the change in the reward function is correct

@pseudo-rnd-thoughts
Copy link
Member

When why the ~4000 point difference? To me, if the agents were already collecting the optimal result then the difference should be on average 10 points

@Kallinteris-Andreas
Copy link
Collaborator Author

Because on some runs with the old reward function, the agent is not able to learn how to "escape" an unbalanced state

The optimal results are identical with both reward functions (since the "optimal" policy, would not be unbalanced)

@pseudo-rnd-thoughts
Copy link
Member

Wow, that is amazing if purely changing that variables causes such a massive change in performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants