[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" #500

Kallinteris-Andreas · 2023-05-11T12:54:32Z

Describe the bug

Gymnasium/gymnasium/envs/mujoco/inverted_double_pendulum_v4.py

Line 155 in c4f67b9

alive_bonus = 10

This is given regardless of if the step is terminating or not

Shouldn't it be

 alive_bonus = 10 * (not terminated)

Code example

No response

System info

No response

Additional context

No response

Checklist

I have checked that there is no similar issue in the repo

The text was updated successfully, but these errors were encountered:

pseudo-rnd-thoughts · 2023-05-11T14:08:45Z

Yes, I think that is a reasonable thing to consider adding for v5. @rodrigodelazcano thoughts?

Kallinteris-Andreas · 2023-05-11T14:41:03Z

Gymnasium/gymnasium/envs/mujoco/inverted_pendulum_v4.py

Line 118 in c4f67b9

reward = 1.0

The same appears to be the case for InvertedPendulumEnv

rodrigodelazcano · 2023-05-11T15:13:44Z

That is a good catch. I agree with @pseudo-rnd-thoughts . This should be added to v5 since v4 only updates to the mujoco bindings and this reward error comes from older versions as well.

Kallinteris-Andreas · 2023-05-11T15:15:28Z

Great I will add it to v5 change list

Kallinteris-Andreas · 2023-05-11T20:06:23Z

Here is some code verifying the bugs

>>> env = gymnasium.make('InvertedPendulum-v4')
>>> env.reset()
(array([-0.00114481,  0.00315834, -0.00689603, -0.00764207]), {})
>>> env.step([1])
(array([ 0.0052199 , -0.01239018,  0.32425438, -0.76226102]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.02474693, -0.05746427,  0.65169342, -1.48966764]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.05732965, -0.13159401,  0.97709572, -2.21890001]), 1.0, False, False, {})
>>> env.step([1])
(array([ 0.10286945, -0.23521879,  1.29895519, -2.96571882]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.16112042, -0.36907483,  1.6112052 , -3.72861976]), 1.0, True, False, {})
>>> env.step([1])
(array([ 0.23148975, -0.53346372,  1.902614  , -4.48774083]), 1.0, True, False, {})

>>> env = gymnasium.make('InvertedDoublePendulum-v4')
>>> env.reset()
(array([-0.05209413, -0.03106399, -0.05757982,  0.9995174 ,  0.99834091,
       -0.00319314, -0.10766195,  0.09683618,  0.        ,  0.        ,
        0.        ]), {})
>>> env.step([1])
(array([ 7.67962813e-04, -1.44606909e-01,  8.22320453e-02,  9.89489182e-01,
        9.96613210e-01,  2.11193186e+00, -4.45134196e+00,  5.49346477e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00]), 9.17710405832815, False, False, {})
>>> env.step([1])
(array([ 0.15545075, -0.44371616,  0.43168352,  0.89616738,  0.90202513,
        3.9987776 , -7.74866516,  7.9830774 ,  0.        ,  0.        ,
        0.        ]), 8.877556821859912, False, False, {})
>>> env.step([1])
(array([ 0.39199627, -0.77051144,  0.69186829,  0.63742617,  0.72202373,
        5.38530052, -8.71215195,  3.8089483 ,  0.        ,  0.        ,
        0.        ]), 8.807853136081622, True, False, {})

Kallinteris-Andreas · 2023-05-13T06:40:52Z

Here is the v4 vs v5 for InvertedPendulum (the only difference in v5 is the reward being fixed)
As expected the v5 version has a faster learning transience

Kallinteris-Andreas · 2023-05-15T06:41:54Z

v4 is the current v4 version
v4-fixed is the current v4 version, with the reward_alive fixed
v5 is the current v4 version, with the reward_alive fixed and the observation fix (#228)

pseudo-rnd-thoughts · 2023-05-15T08:27:12Z

This is a massive reward difference.
This shouldn't explain to me the performance difference as the primary difference is if terminated=True as alive_bonus = 0 so I expected that the episode reward might be 10 points lower.
@Kallinteris-Andreas Am I misunderstanding something?

Kallinteris-Andreas · 2023-05-15T10:00:18Z

The episodic reward being 10 points happens only if the episode terminates (which does not happen after some training regardless of the reward function).

The best policy of all the cases resulted in the same return (~9360), it is just that with the fixed reward function it is possible to get there more consistently

Note: I have double-checked the source codes, nothing is wrong there.

pseudo-rnd-thoughts · 2023-05-15T10:14:00Z

That doesn't explain the ~4000 point increase shown by

To me, the change to the reward function is only when terminated=True such that reward_alive=0. Have I misunderstood something?

Kallinteris-Andreas · 2023-05-15T10:17:23Z

No, your understanding of the change in the reward function is correct

pseudo-rnd-thoughts · 2023-05-15T10:34:37Z

When why the ~4000 point difference? To me, if the agents were already collecting the optimal result then the difference should be on average 10 points

Kallinteris-Andreas · 2023-05-15T10:38:26Z

Because on some runs with the old reward function, the agent is not able to learn how to "escape" an unbalanced state

The optimal results are identical with both reward functions (since the "optimal" policy, would not be unbalanced)

pseudo-rnd-thoughts · 2023-05-15T10:40:35Z

Wow, that is amazing if purely changing that variables causes such a massive change in performance

Kallinteris-Andreas added the bug Something isn't working label May 11, 2023

Kallinteris-Andreas changed the title ~~[Bug Report] InvertedDoublePendulumEnv always gives "alive_bonus"~~ [Bug Report] InvertedDoublePendulumEnv and InvertedPendulumEnv always gives "alive_bonus" May 11, 2023

Kallinteris-Andreas mentioned this issue May 11, 2023

[Proposal] Mujoco-v5 Farama-Foundation/Gymnasium-Robotics#91

Closed

1 task

Kallinteris-Andreas closed this as completed May 15, 2023

This was referenced May 27, 2023

[Bug Report] MuJoCo Envs, healthy reward issues #526

Closed

mujoco-v5 initial commit Farama-Foundation/Gymnasium-Robotics#104

Closed

Kallinteris-Andreas mentioned this issue Jun 30, 2023

Add MuJoCo v5 environments #572

Merged

35 tasks

Kallinteris-Andreas mentioned this issue Nov 24, 2023

[Bug Report] CartPole's reward function constantly returns 1 (even when it falls) #790

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" #500

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" #500

Kallinteris-Andreas commented May 11, 2023 •

edited

Loading

pseudo-rnd-thoughts commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

rodrigodelazcano commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

Kallinteris-Andreas commented May 13, 2023 •

edited

Loading

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023 •

edited

Loading

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

[Bug Report] InvertedDoublePendulumEnv and InvertedPendulumEnv always gives "alive_bonus" #500

[Bug Report] InvertedDoublePendulumEnv and InvertedPendulumEnv always gives "alive_bonus" #500

Comments

Kallinteris-Andreas commented May 11, 2023 • edited Loading

Describe the bug

Code example

System info

Additional context

Checklist

pseudo-rnd-thoughts commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

rodrigodelazcano commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

Kallinteris-Andreas commented May 11, 2023

Kallinteris-Andreas commented May 13, 2023 • edited Loading

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023 • edited Loading

pseudo-rnd-thoughts commented May 15, 2023

Kallinteris-Andreas commented May 15, 2023

pseudo-rnd-thoughts commented May 15, 2023

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" #500

[Bug Report] `InvertedDoublePendulumEnv` and `InvertedPendulumEnv` always gives "alive_bonus" #500

Kallinteris-Andreas commented May 11, 2023 •

edited

Loading

Kallinteris-Andreas commented May 13, 2023 •

edited

Loading

Kallinteris-Andreas commented May 15, 2023 •

edited

Loading