Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problems after reading the paper and code #37

Open
CodingNovice7 opened this issue Mar 1, 2022 · 0 comments
Open

Some problems after reading the paper and code #37

CodingNovice7 opened this issue Mar 1, 2022 · 0 comments

Comments

@CodingNovice7
Copy link

After reading your paper and open source code, I have three doubts. If it's convenient, I hope you can help me answer it.

  1. It seems that the function of reward in the data set is only used to generate return-to-go during get_batch. Although there is reward input during training and evaluation, the function of reward in the network is not seen. Is the function of reward only to generate return to go?
  2. When determining the environment, you need to determine a target_ Return, I don't know What is the function of target_return? It seems that even if it is larger than the largest return-to-go in the existing data set, the final experiment can be successful. Emmm, that is to say, I want to know about target_ What is the impact of return on the Internet?
  3. During evalustion, each target_return evaluation 100 rounds . According to my understanding, the evaluation result should be better and better. That is to say, the reward in the evaluation stage should be better and better. However, the result I ran out is not like this. What is the reason?
    I hope you can solve my doubts at your convenience. Thank you very much! Good luck!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant