-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero score on Freeway #23
Comments
Strengthening the relevance of @emailweixu reproducibility issueHere are my performance results on Freeway, 4 seeds: The 4 seeds obtained a score of 0 by the end of training, however 1 seed did manage to reacher 21.5 reward at some points during training. I used the provided train.sh script (so 4gpus), with the following modifications to fit my setup: I used "--object_store_memory 100000000000" and "--num_cpus 80", which should not impact performance. This issue is related to issue #21 , which points out another reproducibility issue. See issue #21 for potential reasons. Best, |
@rPortelas Actually, I have reasons to believe that zero score for Freeway is expected. If you play Freeway yourself, you can see that it needs consistent exploration for one direction (UP) for many steps in order to get any reward. However, for the current implementation of EfficientZero, the behavior policy is a stochastic policy based on MCTS result. And at the beginning of training, the policy from MCTS is close to uniform given how EfficientZero is initialized (i.e. zero initialization for last layer of prediction nets), which makes it very hard to consistently go UP. Other algorithms such as CURL or SPR uses a greedy policy (coupled with noisy net) and are more likely to have consistent exploration behavior. |
@emailweixu It is true that Freeway is challenging in terms of exploration, however in both the EfficientMuzero paper and the original Muzero paper (check Table S1 in appendix), non-zero performance improvements are reported. So we should be able to reproduce it. |
@rPortelas I know both EfficientZero and MuZero reported reasonable performance on Freeway. The original MuZero is not opensourced so I cannot re-run the experiments and cannot know for sure. But since it trained on much more frames (20B frames), it is more likely to be able to obtain reward though random exploration. Furthermore, the original MuZero paper didn't describe how the weights of the models are initialized, it is possible that non-zero initialization of the last prediction layer can get some reward (non-zero initialization can make the initial policy not uniformly random). In fact, I did try non-zero initialization with EfficientZero (change init_zero to False from True), it did get some reward during the training, but the final performance is still much lower than the reported number. But zero initialization is explicitly described by EfficientZero in A.1. |
Thanks for the discussion! |
@rPortelas did you try the "raw" version you mentioned in #21 on Freeway? |
I tried to run the code for Atari Freeway using the following command with the default settings in the code:
I tried two seeds 0 and 1. Based on tensorboard curves, the algorithm seems to receive no reward at all for training. Both workers.ori_reward and Train_statistics.target_value_prefix_mean are constant zero from beginning to the end.
From train_test_log, seed 0 got positive reward (~7.5) at step 0, but then no reward at all after that. Seed 1 also got ~7.5 reward at step 0, while got 0 for the remaining half of the evaluations. The other half got 21.34.
I wonder whether I did something wrong.
Thanks
Wei
The text was updated successfully, but these errors were encountered: