-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Wrong scaled_action for continuous actions in _sample_action()
?
#1269
Comments
_sample_action()
_sample_action()
_sample_action()
?
_sample_action()
?_sample_action()
?
Hello,
i think the comment is misleading. stable-baselines3/stable_baselines3/common/policies.py Lines 347 to 354 in 30a1984
so the scaling afterward do scale the action back to [-1, 1] for storing it in the replay buffer. I would be happy to receive a PR that update the comment ;) EDIT: the comment is right (it talks about what is the assumption) but misleading |
Thank you for the clear explanation! I am pleased to PR when I come up with a better comment. |
I wonder why we need to store the unscaled action in the replay buffer instead of the final action actually taken in the environment. |
we need to store the action sampled from the underlying action distribution. For TD3/DDPG that don't rely on a probability distribution, you will have issues when you want to add noise to it or when you update the q-values with actions that cannot come from the actor (see vwxyzjn/cleanrl#279 for the kind of issues that you may have). Of course, you could do the scaling everywhere in the gradient update, but this is bug prone and having everything in [-1, 1] makes things simpler. |
Hello @araffin ,I'm curious, why use the following code to scale the action? |
how would you do it otherwise? |
I think I have understood the reason for the above approach. The above scaling method is based on the min-max normalization, and a linear change is made, so the range of action can be limited to [-1,1].Thank you for your reply. |
❓ Question
The following code samples action for an off-policy algorithm. As the comments indicate, the continuous actions obtained in line 395 should have already been scaled by tanh, which puts them in the range (-1, 1). However, in line 399, the action is scaled again, which makes the valid action space even smaller. Is it a potential bug or just my misunderstanding?
stable-baselines3/stable_baselines3/common/off_policy_algorithm.py
Lines 388 to 399 in 5aa6e7d
Below I attach a modified version to demonstrate my idea.
Checklist
The text was updated successfully, but these errors were encountered: