Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Proper TimeLimit/Infinite Horizon Handling for On-Policy algorithm #633

Closed
1 task done
araffin opened this issue Oct 28, 2021 · 6 comments · Fixed by #658
Closed
1 task done
Labels
enhancement New feature or request help wanted Help from contributors is welcomed

Comments

@araffin
Copy link
Member

araffin commented Oct 28, 2021

🚀 Feature

Same as #284 but for on-policy algorithms.
The current workaround is to use a TimeFeatureWrapper (cf. zoo).

### Checklist

  • I have checked that there is no similar issue in the repo (required)
@araffin araffin added enhancement New feature or request help wanted Help from contributors is welcomed labels Oct 28, 2021
@araffin
Copy link
Member Author

araffin commented Nov 5, 2021

@zhihanyang2022

after thinking about it, the fix is maybe quite easy to implement: replace reward by reward + gamma * next_value when there is a timeout.

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 5, 2021

Linking as a related paper on this same topic. There are some experiments we could try to replicate (e.g., in Walker2D the results seem to change quite a bit): https://arxiv.org/abs/1712.00378

after thinking about it, the fix is maybe quite easy to implement: replace reward by reward + gamma * next_value when there is a timeout.

Hmm staring at the lines below I am not sure if this would work out due to GAE: if reward = reward + gamma * next_value, then the bootstrapping happens in the first line but not on the second line. To me it sounds like we need a proper if-else clause somewhere (but then we need to pass the terminal observation along...).

delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step]
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam

One alternative could be to also store terminal observations as part of the trajectory with dummy actions, in which case we do not have to resort to new trickery code. This could also be potentially used to clean the naming hassle with dones (e.g., get rid of the episode_starts), and simplify code.

@araffin
Copy link
Member Author

araffin commented Nov 5, 2021

then the bootstrapping happens in the first line but not on the second line

you are refering to L380 and 381?
why not? (and I meant to do reward = reward + gamma * next_value when filling the buffer, as in https://github.com/leggedrobotics/rsl_rl/blob/master/rsl_rl/algorithms/ppo.py#L108 )

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 5, 2021

you are refering to L380 and 381?

Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.

1) Current setup when termination is encountered in next step

delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step]
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam

2) Ideal setup where timeouts are handled correctly (next state is timeout termination)

delta = self.rewards[step] + self.gamma * next_values * next_non_terminal - self.values[step] <--- bootstrap despite step is done
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam <--- avoid leaking from next episode

3) Changing timeout reward to reward + next_value * gamma

delta = (reward + next_value * self.gamma) + self.gamma * next_values * next_non_terminal - self.values[step] <--- same as in above example
last_gae_lam = delta + self.gamma * self.gae_lambda * next_non_terminal * last_gae_lam <--- avoid leaking from next episode

@araffin
Copy link
Member Author

araffin commented Nov 5, 2021

Yup, and I understood your idea. Turns out I was wrong ^^'. I only noticed it now that I tried to type it out.

so the conclusion is that my proposed hack is valid :p?

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 5, 2021

so the conclusion is that my proposed hack is valid :p?

Yup, at least in this part of the code ^^. I would still rethink the whole process through carefully, as "hacks" like this often break something (and sadly it is hard to test).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Help from contributors is welcomed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants