-
-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] CartPole
's reward function constantly returns 1 (even when it falls)
#790
Comments
Good question and analysis. |
The reason I suggest fixing this "bug" is because even though it is easily solvable by existing learning algorithms, it may not be solvable by a "future learning algorithm", and this would cause the developer of that "future learning algorithm" to reject it during the prototyping phase and not evaluate it further after it fails in a "simple"
|
Sorry, I'm not sure what you are suggesting that we should do here, do you want to create a cartpole v2 that is very different from v0 or v1? |
To summarize a bit
|
That is fair but I think the second aspect is as important that the agent is to maximise the total reward over time so as the agent is terminated after falling, it is not an optimal strategy to fall immediately. |
I don't think it's correct to say that the reward function is wrong. The current reward function still forces an agent to learn to balance the pole in order to maximize return, which is the discounted sum of rewards. That's what an RL algorithm does. Now, I agree that the reward defined in CartPole is silly, but only because empirically (in my experience), it makes Q-learning much more difficult than it needs to be. The Q values for good actions are effectively unbounded (subject only to the time limit and the discount), so the variance is quite high. However, I agree with @pseudo-rnd-thoughts that the goal should be to accurately reflect the original CartPole problem defined by Sutton and Barto. That's why I was surprised when I looked at the original C code and found that it defines a reward of -1 for failure and 0 otherwise (which is also what I found to be a good reward through experimentation). Not sure what should be done with this information, but there you go. |
@pseudo-rnd-thoughts other than aversion to making a new revision, is there another reason you oppose the fixing of this bug? |
Yes, I think that version change is my largest worry but if we change to 0 and -1 for rewards, then this to me is a completely different environment compared to the current implementation. The change to the final reward makes more sense as a change but we could do this with a simple parameter for the environment but experimentally, I think you have shown this makes minimal difference. |
I just want to point out that with |
This is not the case. You never get a "reward of 50" for staying up 50 steps. Instead you only get a reward of 1 for any step, even the terminated steps. Then you stuff all those steps in your memory and repeatedly train them — always setting the 'y_true' in your loss function for +1 more than the current Q value, other than terminal cases where the target_network predictions are zeroed out, so you always set your 'y_true' == 1. It's a terrible setup but it's still possible to learn a good policy. Ultimately states that are 1 step away from termination (example: if you go right you will terminate) will have their Q values approach 2. This is because the terminated state will end up over time having an average Q value of 1 and so the state before termination will be asked to predict it's next state value, which we know is 1. then we add 1 reward, and so the 'y_true' passed to the loss function over and over is 2. You can repeat the logic and see states 2 steps away will approach 3, etc. |
I think @Kallinteris-Andreas suggestion option 1 is good, but with I would add to it to ensure that Changing this setting also reduces training time because we store the correct 'x' and 'theta' values in the state for training and not the previous states x and theta which is what it does now by default. Additionally, changing the terminated reward to -1 helps to reduces training cycles that do not contribute, at all, to any learning of the policy when the terminated state is also given a reward of 1. Here's why: Initial Q values of an randomly initialized network are near zero. With the default 1 reward for everything, it means that even terminal states will have Q value that approach 1 over time. All of the initial training pushes both "good" middleish states and terminal states up to Q >= 1, and only above 1 do the Q values kick in and do any learning. |
One other note, https://github.com/marek-robak/Double-cartpole-custom-gym-env-for-reinforcement-learning uses a different reward approach: "The reward function consists of linear dependence of the cart distance from the center of the available space and a -50 penalty for premature termination of the episode." I don't think we should go this direction because it changes the problem too much. I think we should stay as close to the current reward scheme because it is actually an interesting problem to dissect. "how is it possible to learn a policy when the only rewards you get are from the termination and nothing else"? And interestingly the with enough training the policies will eventually keep the pole up and in the middle nearly perfectly. So the termination cases become boundaries that the rest of the non-terminal cases smooth over. |
@pseudo-rnd-thoughts I have made my arguments on why it should be fixed, it is primarily that references environments like CartPole should be correct, a few more people have weighted in with some additional analysis on the result of the bug. If you believe it is not worth fixing please close this issue and label it as |
I think my preferred solution would be to add a Does that work for you @Kallinteris-Andreas |
I would add I should have time next week to work on it, or if you prefer to start it go for it |
Describe the bug
The
CartPole
environment providesreward==1
when the pole "stands" andreward==1
when the pole has "fallen".The old gym documentation mentioned that this was the behavior, and so does the current documentation, indicating that this is the desired behavior, but I can find no evidence that this was the design goal.
The argument could be made that "state-of-the-art algorithms (e.g.
DQN
,A2C
,PPO
) can easily solve this environment anyway, so why bother?", which is true that the environment is very easy to solve, but it is still important to examine why, and what the consequences are in these learning algorithms and potential future ones.The reason the algorithms are able to learn a policy, despite the fact that the environment is effectively reward-free, is because of the sampling bias introduced by the number of actions taken per given policy.
If the agent has a "good policy", then on average it will be alive longer than if it has a "bad policy", and this will cause the "good policy" to be sampled more often on average by the
train()
subroutine of the learning algorithm, and therefore a "good policy" will be reinforced more than a "bad policy" (instead of the "good policy" being reinforced and the "bad policy" being reduced, which is what normally happens in RL).This would mean that RL algorithms that are not affected by this sampling bias (or are much less affected by it) would not be able to learn a policy for the
CartPole-v1
environment,And the CartPole environment is where many RL algorithms are tested during the prototyping phase (since it is probably assumed that if a learning algorithm cannot solve CartPole, it is probably a dud).
Therefore, if an algorithm that was developed was less affected by this sampling bias, it might have failed the prototyping phase because of the wrong implementation of
Gymnasium/CartPole
.Quick Performance Analysis
DQN
seems to be benefit significantly from fixing this bug.A2C
seem to benefit slightly from fixing this bug.PPO
does not seem to be affected by this bug, as it can already easily solve the problem.Note: The shaded area is the
(min, max)
episode length of each algorithm.Suggestion
Suggestion 1 (the one I recommend)
Fix this bug and update
CartPole
tov2
, and add an argument to recreate the old behavior.Suggestion 2
Do not fix this bug, but update the description to clearly indicate that this is a "reward-free" variant of the cart pole environment.
Additional references
Old
openai/gym
related issues: openai/gym#1682, openai/gym#704, openai/gym#21.This issue has existed since
gym=0.0.1
(hello world commit by gdb)It is the same issue that
MuJoCo/Pendulum
s had #500, and #526The original CartPole implementation by Sutton did not have a constant reward function, nor did not the paper
Code example
Additional context
Gymnasium/gymnasium/envs/classic_control/cartpole.py
Lines 188 to 203 in 34872e9
Checklist
The text was updated successfully, but these errors were encountered: