Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about VPG implementation #141

Closed
djjh opened this issue Apr 16, 2019 · 4 comments
Closed

Question about VPG implementation #141

djjh opened this issue Apr 16, 2019 · 4 comments

Comments

@djjh
Copy link

djjh commented Apr 16, 2019

Well two questions really, about these lines:

# Policy gradient step
sess.run(train_pi, feed_dict=inputs)
# Value function learning
for _ in range(train_v_iters):
sess.run(train_v, feed_dict=inputs)

  1. Is the order between updating the value function estimator and the policy all that important?
  2. Why do we need to have an inner loop for training the value function estimator when the input data is not changing? (My guess would be to avoid local truncation error from alternatively increasing the learning rate)
@jachiam
Copy link
Contributor

jachiam commented Apr 17, 2019

  1. The order in this particular case doesn't matter at all, because the policy and value function share no parameters.

  2. The inner loop is to make more progress on solving the value-learning optimization problem (find a map from states to reward-to-go which maps the empirical reward-to-go at this iteration) than a single gradient step alone would make. If you were to take a large-learning rate single step, you would probably land at the wrong parameters (because the loss is not linear in the parameters), so multiple steps of gradient descent help.

@jachiam jachiam closed this as completed Apr 17, 2019
@djjh
Copy link
Author

djjh commented Apr 18, 2019

Thanks!

@djjh
Copy link
Author

djjh commented Apr 20, 2019

Quick follow up question about #2, is the same logic not applied to solving the policy optimizing problem because the loss function is not mean to converge? Could more iterations be useful for any other reason?

@rojas70
Copy link

rojas70 commented Jun 14, 2020

Given GAE, you need the best V_estimate for a given policy Pi to compute your advantage function. So, all your computation needs to be constrained by the equations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants