Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper and implementation are different. #6

Closed
jcwleo opened this issue Oct 29, 2018 · 9 comments
Closed

Paper and implementation are different. #6

jcwleo opened this issue Oct 29, 2018 · 9 comments

Comments

@jcwleo
Copy link

jcwleo commented Oct 29, 2018

In paper,
image
but, implementation just use reward. not sum of discounted reward.
Why is it different?

@yburda
Copy link
Contributor

yburda commented Oct 29, 2018

We are normalizing the reward here:

rews = self.rollout.buf_rews / np.sqrt(self.rff_rms.var)

The normalization is by running std of a sum of discounted rewards here:
class RewardForwardFilter(object):
def __init__(self, gamma):
self.rewems = None
self.gamma = gamma
def update(self, rews):
if self.rewems is None:
self.rewems = rews
else:
self.rewems = self.rewems * self.gamma + rews
return self.rewems

One caveat is that for convenience we do the discounting backwards in time rather than forwards (it's convenient because at any moment the past is fully available and the future is yet to come).

@jcwleo
Copy link
Author

jcwleo commented Oct 29, 2018

@yburda thank you for reply. but i’ve already known that code. RewardFowrdFilter.rewems is None. so RewardForwardFilter.update just return input reward.
Am I thinking wrong?

@yburda
Copy link
Contributor

yburda commented Oct 30, 2018

Thank you for pointing this out. We will update the paper (we reported results with a version of code very similar to the published one, so the code is representative).

@jcwleo
Copy link
Author

jcwleo commented Oct 31, 2018

@yburda Did you mean that you did not use sum of dicounted reward?

@yburda
Copy link
Contributor

yburda commented Oct 31, 2018

Yes.

@jcwleo
Copy link
Author

jcwleo commented Oct 31, 2018

@yburda Thank you very much! :)

@jcwleo jcwleo closed this as completed Oct 31, 2018
@yburda
Copy link
Contributor

yburda commented Nov 1, 2018

Upon thinking about it a bit longer - RewardForwardFilter.rewems is None only the first time you call update. Then it assigns something to it:

self.rewems = rews

And for all future calls to update, it's not None anymore.

Sorry for the temporary confusion.

@yburda yburda reopened this Nov 1, 2018
@jcwleo
Copy link
Author

jcwleo commented Nov 1, 2018

@yburda Oh I was so stupid... Thank you for letting me know.

@alirezakazemipour
Copy link

We are normalizing the reward here:

rews = self.rollout.buf_rews / np.sqrt(self.rff_rms.var)

The normalization is by running std of a sum of discounted rewards here:

class RewardForwardFilter(object):
def __init__(self, gamma):
self.rewems = None
self.gamma = gamma
def update(self, rews):
if self.rewems is None:
self.rewems = rews
else:
self.rewems = self.rewems * self.gamma + rews
return self.rewems

One caveat is that for convenience we do the discounting backward in time rather than forwards (it's convenient because at any moment the past is fully available and the future is yet to come).

Sir, I think the code is not right;
It must have been self.I.buf_rews_int.T[::-1] as 4kasha has mentioned.
and " it's convenient because at any moment the past is fully available and the future is yet to come" does not make sense, since you have collected all the intrinsic rewards in a rollout, so the future is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants