Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Wrong scaled_action for continuous actions in _sample_action()? #1269

Open
4 tasks done
KiwiXR opened this issue Jan 10, 2023 · 7 comments
Open
4 tasks done
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@KiwiXR
Copy link

KiwiXR commented Jan 10, 2023

❓ Question

The following code samples action for an off-policy algorithm. As the comments indicate, the continuous actions obtained in line 395 should have already been scaled by tanh, which puts them in the range (-1, 1). However, in line 399, the action is scaled again, which makes the valid action space even smaller. Is it a potential bug or just my misunderstanding?

if self.num_timesteps < learning_starts and not (self.use_sde and self.use_sde_at_warmup):
# Warmup phase
unscaled_action = np.array([self.action_space.sample() for _ in range(n_envs)])
else:
# Note: when using continuous actions,
# we assume that the policy uses tanh to scale the action
# We use non-deterministic action in the case of SAC, for TD3, it does not matter
unscaled_action, _ = self.predict(self._last_obs, deterministic=False)
# Rescale the action from [low, high] to [-1, 1]
if isinstance(self.action_space, spaces.Box):
scaled_action = self.policy.scale_action(unscaled_action)

Below I attach a modified version to demonstrate my idea.

    else:
        # Note: when using continuous actions,
        # we assume that the policy uses tanh to scale the action
        # We use non-deterministic action in the case of SAC, for TD3, it does not matter
        unscaled_action, _ = self.predict(self._last_obs, deterministic=False)
        # unscale in the continuous case, as the prediction is scaled
        if isinstance(self.action_space, gym.spaces.Box):
            unscaled_action = self.policy.unscale_action(unscaled_action)

    # Rescale the action from [low, high] to [-1, 1]
    if isinstance(self.action_space, gym.spaces.Box):
        scaled_action = self.policy.scale_action(unscaled_action)

Checklist

  • I have checked that there is no similar issue in the repo
  • I have read the documentation
  • If code there is, it is minimal and working
  • If code there is, it is formatted using the markdown code blocks for both code and stack traces.
@KiwiXR KiwiXR added the question Further information is requested label Jan 10, 2023
@KiwiXR KiwiXR changed the title [Question] Double scale_action for continuous actions in _sample_action() [Question] Double scale_action for continuous actions in _sample_action() Jan 10, 2023
@KiwiXR KiwiXR changed the title [Question] Double scale_action for continuous actions in _sample_action() [Question] Wrong scaled_action for continuous actions in _sample_action() ? Jan 10, 2023
@KiwiXR KiwiXR changed the title [Question] Wrong scaled_action for continuous actions in _sample_action() ? [Question] Wrong scaled_action for continuous actions in _sample_action()? Jan 10, 2023
@araffin
Copy link
Member

araffin commented Jan 10, 2023

Hello,

As the comments indicate, the continuous actions obtained in line 395 should have already been scaled by tanh, which puts them in the range (-1, 1).

i think the comment is misleading.
The output of predict() is in [low, high] (that's why the name is unscaled_action), the unscaling is done only when using squashing (which is the case for all off-policy algorithms):

if isinstance(self.action_space, spaces.Box):
if self.squash_output:
# Rescale to proper domain when using squashing
actions = self.unscale_action(actions)
else:
# Actions could be on arbitrary scale, so clip the actions to avoid
# out of bound error (e.g. if sampling from a Gaussian distribution)
actions = np.clip(actions, self.action_space.low, self.action_space.high)

so the scaling afterward do scale the action back to [-1, 1] for storing it in the replay buffer.

I would be happy to receive a PR that update the comment ;)

EDIT: the comment is right (it talks about what is the assumption) but misleading

@araffin araffin added the documentation Improvements or additions to documentation label Jan 10, 2023
@KiwiXR
Copy link
Author

KiwiXR commented Jan 11, 2023

Thank you for the clear explanation! I am pleased to PR when I come up with a better comment.

@AmorWwQ
Copy link

AmorWwQ commented Feb 8, 2023

I wonder why we need to store the unscaled action in the replay buffer instead of the final action actually taken in the environment.

@araffin
Copy link
Member

araffin commented Feb 8, 2023

I wonder why we need to store the unscaled action in the replay buffer instead of the final action actually taken in the environment.

we need to store the action sampled from the underlying action distribution.
For SAC, as it uses a tanh, the scaled/unscaled actions are the same most of the time (when the bounds are symmetric and in [-1, 1]).
If you store the unscaled action (between [low, high]) then you will have issues when doing the gradient update (for instance you will compute the likelihood of sampling such action, and if low < -1 or high > 1, then it is zero because SAC only output actions in [-1, 1]).

For TD3/DDPG that don't rely on a probability distribution, you will have issues when you want to add noise to it or when you update the q-values with actions that cannot come from the actor (see vwxyzjn/cleanrl#279 for the kind of issues that you may have).

Of course, you could do the scaling everywhere in the gradient update, but this is bug prone and having everything in [-1, 1] makes things simpler.

@wagh311
Copy link

wagh311 commented Apr 3, 2023

Hello @araffin ,I'm curious, why use the following code to scale the action?
return 2.0 * ((action - low) / (high - low)) - 1.0
In particular, why multiply by 2 and subtract 1?

@araffin
Copy link
Member

araffin commented Apr 3, 2023

In particular, why multiply by 2 and subtract 1?

how would you do it otherwise?

@wagh311
Copy link

wagh311 commented Apr 3, 2023

I think I have understood the reason for the above approach. The above scaling method is based on the min-max normalization, and a linear change is made, so the range of action can be limited to [-1,1].Thank you for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants