You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After reading your paper and open source code, I have three doubts. If it's convenient, I hope you can help me answer it.
It seems that the function of reward in the data set is only used to generate return-to-go during get_batch. Although there is reward input during training and evaluation, the function of reward in the network is not seen. Is the function of reward only to generate return to go?
When determining the environment, you need to determine a target_ Return, I don't know What is the function of target_return? It seems that even if it is larger than the largest return-to-go in the existing data set, the final experiment can be successful. Emmm, that is to say, I want to know about target_ What is the impact of return on the Internet?
During evalustion, each target_return evaluation 100 rounds . According to my understanding, the evaluation result should be better and better. That is to say, the reward in the evaluation stage should be better and better. However, the result I ran out is not like this. What is the reason?
I hope you can solve my doubts at your convenience. Thank you very much! Good luck!
The text was updated successfully, but these errors were encountered:
After reading your paper and open source code, I have three doubts. If it's convenient, I hope you can help me answer it.
I hope you can solve my doubts at your convenience. Thank you very much! Good luck!
The text was updated successfully, but these errors were encountered: