Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PPO timeout proper handling #198

Open
Tracked by #206
Howuhh opened this issue Jun 10, 2022 · 12 comments
Open
Tracked by #206

PPO timeout proper handling #198

Howuhh opened this issue Jun 10, 2022 · 12 comments

Comments

@Howuhh
Copy link
Contributor

Howuhh commented Jun 10, 2022

Hi! I'm a bit puzzled as to how a timeout could be handled correctly in your implementation of PPO (well, this is relevant for all variants). I am especially surprised by envpool, because seems like they do not return the last real state at all (like gym vec envs do in info).

Were there any ideas on how to do it right? I'd like to make a PR with that fix, but I haven't figured out how yet. Buffer in PPO is a bit more confusing than in off-policy algos where there is a clear (s, s', d) thing..

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 10, 2022

Proper timeout handling is definitely an interesting issue I want to look into further. I think the best resource is DLR-RM/stable-baselines3#658. Happy to take a look into this together :)

If you are submitting a fix PR, I highly suggest doing it just on ppo.py to show a proof-of-concept before applying the changes to other PPO variants.

@Howuhh
Copy link
Contributor Author

Howuhh commented Jun 11, 2022

Thanks for the link! I'll take a look and see what I can come up with.

@Howuhh
Copy link
Contributor Author

Howuhh commented Jun 11, 2022

After some thinking and sketching on a piece of paper, it seems to me that it could be solved this way (just proposal for now):

# define buffer's here: states, actions, rewards, dones, values, logprobs
# Initial obs and info
state, prev_info = torch.Tensor(envs.reset()).to(device), {}

for step in range(0, args.num_steps):
    with torch.no_grad():
        action, logprob, _, value = agent.get_action_and_value(state)
    next_state, reward, done, info = envs.step(action.cpu().numpy())

    # full transition on step T
    states[step] = state
    actions[step] = action
    rewards[step] = torch.tensor(reward).to(device).view(-1)
    dones[step] = done
    # also PPO stuff for step T
    logprobs[step] = logprob
    values[step] = value.flatten()

    # Here we should check for timeout on previous step
    if step > 1 and torch.any(dones[step - 1]):
        # if on prev step was timeout, then we should
        # 1. set dones[step - 1][env_id] to False, as it was not real done
        # 2. set values[step][env_id] to V(prev_info['terminal_observation']) (as it is real next_state for previous step)
        for env_id in range(args.num_envs):
            timeout = "TimeLimit.truncated" in prev_info[env_id] and prev_info[env_id]["TimeLimit.truncated"]
            if timeout:
                terminal_state = torch.tensor(prev_info[env_id]["terminal_observation"], device=device)
                # Set done to false as it was timeout
                dones[step - 1, env_id] = 0 
                # Set value to V of terminal_state (not state as it is first state after reset on previous step)
                values[step, env_id] = agent.get_value(terminal_state).flatten() 

    state, prev_info = torch.Tensor(next_state).to(device), info

This small rearrangement, of course, would require separately handling the last state after num_steps, since we need it to bootstrap the last transition, but it could be done in similar way. GAE computation then would be unchanged at all!

The main point here is that we never use value[step] to compute anything for state[step], only for state[step - 1]:

r[step - 1] + gamma * not_done[step - 1] * value[step]

So it seems to me that we can safely swap it inplace here. Does that sound reasonable to you?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 12, 2022

Thanks for the sketch. I think it's a little tricky though so I wrote down what the variables look like as below (link)

image

I think the correct implementation is consistent with https://github.com/DLR-RM/stable-baselines3/pull/658/files#diff-384b5f21f2bed58d1d6e64da04a42fee52f353fcec38bf410338524336657bd8R205

            # Handle timeout by bootstraping with value function
            # see GitHub issue #633
            for idx, done_ in enumerate(dones):
                if (
                    done_
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]
                    rewards[idx] += self.gamma * terminal_value

Is this equivalent to your implementation? I didn't quite get the reason behind step - 1 in the sketch.

Next week will be pretty busy but will try to respond.

CC @araffin

@Howuhh
Copy link
Contributor Author

Howuhh commented Jun 12, 2022

After thinking a bit more, it seems that my implementation is not correct because it will not correctly account for the done flag in GAE (last_gae_lam in example 2 from @Miffyli in DLR-RM/stable-baselines3#633, while delta computation will be ok) so your version is more correct.

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

I will still try to explain the thinking behind step - 1. Suppose we made 3 steps:
$s_0, r_0, d_0, v(s_0), s_1, r_1, d_1, v(s_1), s_2, r_2, d_2, v(s_2)$ and $d_1$ is done and timeout.

We have two variants of value computation for step 1:

  1. $r_1$, as $d_1$ is True - false
  2. $r_1 + \gamma v(s_2)$ - correct, but $s_2$ here will be first state after reset from next episode, so we should correct it to.

Because we will add $v(s_2)$ to the buffer on step 2, we should check if $d_1$ (where step - 1 comes from) is timeout and if it is then change:

  1. $d_1$ -> False
  2. $v(s_2)$ -> $v(s_{2 \ terminal})$

Then value estimation will be ok:
$r_1 + \gamma v(s_{2 \ terminal})$

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive, that we add to the buffer something (done to be specific) from the last step (and why?). In my example (and code in my projects) I add done on the same step.

dones[step] = next_done

For now I think your proposal is better, I will try to make PR & some runs on gym Mujoco.

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 12, 2022

I don't like the fact that we're implicitly changing the rewards, though, since in other PPO variants they could be used for something else :(

That's a good point. One way we could address is to create another storage variable terminal_values with the same shapes as values, and we can utilize this terminal_values during the return and advantage calculation.

Also a quick note: the new gym API for truncation is coming soon openai/gym#2752. Ideally we should be prototyping using the new API.

I will still try to explain the thinking behind ...

One issue with the example you provided is storage - $v(s_2)$ is also useful and needs to be stored somewhere.

Actually you do the same as done in our code will be added to the buffer in the next iteration of the loop (so it is step - 1). To be honest it confuses me every time since it is not intuitive,

next_obs and next_done serve a very intricate role. Please see the "vectorized architecture" section in https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for a detailed explanation.

1B58CB6E-51B0-47FF-A45D-981C7E1783FA

@vwxyzjn
Copy link
Owner

vwxyzjn commented Jun 12, 2022

Here is a more complete walkthrough

image

@FelipeMartins96
Copy link
Contributor

Found this issue as I was wondering if the isaacgym ppo was handling timeouts; it is not, right? I will check the code more thoroughly tomorrow and try to find a solution for my use case.

Additionally, I remember having issues on isaacgymenvs on how they handle the timeouts and resets; I think my env diverges a bit from theirs, so I'm not sure if it is handling their case but not mine

@vwxyzjn
Copy link
Owner

vwxyzjn commented Mar 23, 2023

Yeah the isaacgym ppo variant does not deal with truncation properly… I sort of lost a bit of interest in it because properly handling it did not seem to result in significant performance difference. See sail-sg/envpool#194 (comment)

@FelipeMartins96
Copy link
Contributor

@vwxyzjn did you notice any significant overhead by handling truncations when benchmarking?

@vwxyzjn
Copy link
Owner

vwxyzjn commented Mar 24, 2023

Not really significant overhead.

@gmarkkula
Copy link

gmarkkula commented Apr 10, 2024

Just chiming in here to agree with this comment in a related thread: I have experienced that proper timeout handling can make quite a difference. Thought I'd share my experience and solution here in case it's useful to others, and to encourage getting it implemented in CleanRL. (In relation to the discussions above I am not sure about the best way to do this though.)

Anyway: I have a vectorised environment that I was getting good results with from SB3, but to speed things up I wanted to move both environment and RL to GPU, building from CleanRL ppo_continuous_action_isaacgym.py. However, the CleanRL PPO did not replicate the good SB3 PPO results, After much head-scratching, I discovered that disabling the timeout handling in SB3 (by no longer reporting truncations and not passing terminal observations) got SB3 stuck in the same suboptimum as CleanRL, and after adding bootstrapping at truncation to the CleanRL implementation I now get very similar good results with CleanRL to what I was getting with SB3 out of the box.

For reference, in my environment the agent can incur a lot of effort penalties early on while it's trying to learn how to get positive rewards, so early on rewards will be mostly negative. My understanding of what was happening here is that without truncation bootstrapping, just giving up and doing nothing becomes a very strong suboptimum, because every now and then the agent still gets a random (from its Markov perspective) treat of reaching the episode truncation, where without bootstrapping the reached state seems to have about zero value, which is much better than the negative values incurred when trying to learn. And the value function learning becomes noisy and slow, because there is this weird zero value that happens unpredictably every now and then. (Apols if this is obvious to others - it took me a while to suss it out.)

I based my bootstrap implementation on the SB3 one, as referenced in Costa's comment above, changing this line:

            next_obs, rewards[step], next_done, info = envs.step(action)

to:

            next_obs, rewards[step], terminations, truncations, info = envs.step(action)
            next_done = (terminations | truncations).long()
            with torch.no_grad():
                terminal_values = agent.critic(info['final_observation']).squeeze()
            rewards[step] = torch.where(
                truncations, rewards[step] + args.gamma * terminal_values, rewards[step])

This approach directly modifies the rewards, as discussed in the thread above, but that doesn't matter to me in this case.

Note that I am using my own environment here, not an IsaacGym environment, so the info['final_observation'] is created by me in my environment class. Based on my (possibly imperfect?) interpretation of gymnasium.vector.VecEnv I am passing this as the observation tensor (same shape as next_obs), but with the information from before terminated/truncated environments have been reset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants