reproduce the result of CrazyClimber #21

yueyang130 · 2022-04-06T13:48:16Z

The paper show three runnings of CrazyClimber as below. It seems stable and of high performance.

However, when I rerun the code three times using the given command. I got the following results. The runnings broke off before training finished, but it still revealed that the reproduced result is far lower by an order of magnitude than what the paper reported.

Could you give some possible explanations why there is a huge difference?

# running 1
[2022-04-06 02:47:47,242][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 643.75     (max: 1000.0    , min:300.0     , std: 178.42628029525247)
[2022-04-06 03:54:07,033][train_test][INFO][log.py>_log] ==> #10083      Test Mean Score of CrazyClimberNoFrameskip-v4: 3071.875   (max: 3400.0    , min:2100.0    , std: 383.4338070319309)
[2022-04-06 04:55:49,719][train_test][INFO][log.py>_log] ==> #22478      Test Mean Score of CrazyClimberNoFrameskip-v4: 634.375    (max: 1300.0    , min:300.0     , std: 249.51124097923926)
[2022-04-06 05:57:18,634][train_test][INFO][log.py>_log] ==> #34576      Test Mean Score of CrazyClimberNoFrameskip-v4: 3378.125   (max: 5700.0    , min:2700.0    , std: 736.4332857598168)
[2022-04-06 06:53:54,973][train_test][INFO][log.py>_log] ==> #46427      Test Mean Score of CrazyClimberNoFrameskip-v4: 3065.625   (max: 6800.0    , min:2300.0    , std: 761.0064778797878)
[2022-04-06 07:55:10,556][train_test][INFO][log.py>_log] ==> #57917      Test Mean Score of CrazyClimberNoFrameskip-v4: 8534.375   (max: 11700.0   , min:4000.0    , std: 2101.5781830269843)
[2022-04-06 08:51:19,116][train_test][INFO][log.py>_log] ==> #69158      Test Mean Score of CrazyClimberNoFrameskip-v4: 6453.125   (max: 12100.0   , min:3100.0    , std: 2302.849372923683)
[2022-04-06 09:47:16,916][train_test][INFO][log.py>_log] ==> #80299      Test Mean Score of CrazyClimberNoFrameskip-v4: 7246.875   (max: 10300.0   , min:4900.0    , std: 1379.5344266726365)

# running 2
[2022-04-05 11:39:55,952][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 600.0      (max: 900.0     , min:300.0     , std: 152.0690632574555)
[2022-04-05 12:46:50,652][train_test][INFO][log.py>_log] ==> #10015      Test Mean Score of CrazyClimberNoFrameskip-v4: 393.75     (max: 800.0     , min:100.0     , std: 132.13984069916233)
[2022-04-05 13:41:18,799][train_test][INFO][log.py>_log] ==> #20080      Test Mean Score of CrazyClimberNoFrameskip-v4: 3634.375   (max: 5100.0    , min:1900.0    , std: 705.6067313844164)
[2022-04-05 14:31:57,618][train_test][INFO][log.py>_log] ==> #30013      Test Mean Score of CrazyClimberNoFrameskip-v4: 9634.375   (max: 13100.0   , min:3500.0    , std: 2230.1358387719347)
[2022-04-05 15:35:49,472][train_test][INFO][log.py>_log] ==> #40039      Test Mean Score of CrazyClimberNoFrameskip-v4: 6434.375   (max: 10000.0   , min:3500.0    , std: 1384.875755934445)

# running 3
[2022-04-05 04:43:53,277][train_test][INFO][log.py>_log] ==> #0          Test Mean Score of CrazyClimberNoFrameskip-v4: 615.625    (max: 1000.0    , min:400.0     , std: 146.00807982779583)
[2022-04-05 05:52:29,495][train_test][INFO][log.py>_log] ==> #10033      Test Mean Score of CrazyClimberNoFrameskip-v4: 950.0      (max: 1500.0    , min:700.0     , std: 264.5751311064591)
[2022-04-05 06:55:41,171][train_test][INFO][log.py>_log] ==> #20077      Test Mean Score of CrazyClimberNoFrameskip-v4: 2771.875   (max: 3800.0    , min:1000.0    , std: 850.086162912325)
[2022-04-05 07:58:09,925][train_test][INFO][log.py>_log] ==> #30064      Test Mean Score of CrazyClimberNoFrameskip-v4: 4550.0     (max: 7300.0    , min:2300.0    , std: 1228.3118496538248)
[2022-04-05 08:59:21,098][train_test][INFO][log.py>_log] ==> #40001      Test Mean Score of CrazyClimberNoFrameskip-v4: 9137.5     (max: 12500.0   , min:5000.0    , std: 1880.2842205368847)
[2022-04-05 10:00:18,121][train_test][INFO][log.py>_log] ==> #50019      Test Mean Score of CrazyClimberNoFrameskip-v4: 10393.75   (max: 24000.0   , min:5700.0    , std: 3386.1794012574114)

The text was updated successfully, but these errors were encountered:

yueyang130 · 2022-04-06T13:49:32Z

Additionally, I use the test bash script to evaluate the model checkpoint saved in my running 1. The result is as below

[2022-04-06 13:47:41,957][test][INFO][main.py><module>] ==> Test Mean Score: 9603.125 (max: 15300.0, min: 4100.0)
[2022-04-06 13:47:41,958][test][INFO][main.py><module>] ==> Test Std Score: 3000.5712346776572

rPortelas · 2022-05-04T15:09:11Z

Hello,

First of all, thank you for releasing this much-needed open-source MuZero implementation :).

Strengthening the relevance of this reproducibility issue

Here are my performance results on CrazyClimber, 4 seeds:

I confirm that running the provided training scripts does not lead to results (around 10K mean perf) comparable to those obtained in the paper (around 84k mean perf).
Note that I performed the following modifications to fit my setup: I used "--object_store_memory 100000000000" and "--num_cpus 80", which should not impact performance (right ?).

Potential reasons

One potential reason for this could be that the current train.sh and Atari/Muzero config files differ slightly from those used in the paper. @authors, could it be the case ?

Or maybe it is the version of PyTorch/Python that @yueyang130 and I are using that causes these differences ?
On my side I ran experiments with PyTorch 1.11.0 and Python 3.7.13.

It could also be bad luck, but now that we have more data points it seems unlikely. A similar reproducibility issue has also been raised for Freeway (#23).

Any ideas ? (I'm willing to perform additional experiments if needed)

Hoping to help :)

Best,
Rémy

yix081 · 2022-06-15T05:16:44Z

I would love to be connected.

I got these results for the crazy climber by training with 5 MCTS simulations

`[2022-06-14 16:57:12,911][train_test][INFO][log.py>_log] ==> #0 Test Mean Score of CrazyClimberNoFrameskip-v4: 568.75 (max: 1000.0 , min:300.0 , std: 142.38482187368146)

[2022-06-14 18:17:05,152][train_test][INFO][log.py>_log] ==> #10003 Test Mean Score of CrazyClimberNoFrameskip-v4: 2765.625 (max: 3500.0 , min:1900.0 , std: 485.8043426884943)

[2022-06-14 19:20:19,497][train_test][INFO][log.py>_log] ==> #20001 Test Mean Score of CrazyClimberNoFrameskip-v4: 1968.75 (max: 2300.0 , min:1400.0 , std: 189.4688298903015)

[2022-06-14 20:20:17,917][train_test][INFO][log.py>_log] ==> #30009 Test Mean Score of CrazyClimberNoFrameskip-v4: 2121.875 (max: 3000.0 , min:800.0 , std: 548.1413908609712)

[2022-06-14 21:20:22,081][train_test][INFO][log.py>_log] ==> #40034 Test Mean Score of CrazyClimberNoFrameskip-v4: 2400.0 (max: 3300.0 , min:900.0 , std: 765.6696415556777)

[2022-06-14 22:20:22,987][train_test][INFO][log.py>_log] ==> #50046 Test Mean Score of CrazyClimberNoFrameskip-v4: 17384.375 (max: 26200.0 , min:10300.0 , std: 3134.197402745239)

[2022-06-14 23:19:36,779][train_test][INFO][log.py>_log] ==> #60039 Test Mean Score of CrazyClimberNoFrameskip-v4: 21493.75 (max: 29000.0 , min:17100.0 , std: 4227.952925175492)

[2022-06-15 00:18:15,167][train_test][INFO][log.py>_log] ==> #70050 Test Mean Score of CrazyClimberNoFrameskip-v4: 44531.25 (max: 70900.0 , min:11400.0 , std: 14378.900286096292)

[2022-06-15 01:15:28,391][train_test][INFO][log.py>_log] ==> #80031 Test Mean Score of CrazyClimberNoFrameskip-v4: 62137.5 (max: 75800.0 , min:29100.0 , std: 11427.946830030318)

[2022-06-15 02:11:17,348][train_test][INFO][log.py>_log] ==> #90055 Test Mean Score of CrazyClimberNoFrameskip-v4: 36909.375 (max: 55400.0 , min:19900.0 , std: 7831.122180720653)

[2022-06-15 03:06:37,899][train_test][INFO][log.py>_log] ==> #100033 Test Mean Score of CrazyClimberNoFrameskip-v4: 61681.25 (max: 75800.0 , min:27200.0 , std: 14602.106815028439)

[2022-06-15 04:02:12,701][train_test][INFO][log.py>_log] ==> #110069 Test Mean Score of CrazyClimberNoFrameskip-v4: 58618.75 (max: 73500.0 , min:37800.0 , std: 13647.560164274784)

[2022-06-15 04:53:40,941][train_test][INFO][main.py>] ==> #120000 Test Mean Score of CrazyClimberNoFrameskip-v4: 83828.125 (max: 118100.0 , min:21100.0 , std: 29290.103217373184)`

rPortelas · 2022-06-15T15:06:54Z

Hello @yix081,

The results you are showing are similar to those reported by authors (well played!). Did you obtain them using the current version of the codebase, without additional modifications ? If yes could you share additional details, e.g. your python package versions ('conda list' if you use conda) please ?

On my side since I could not reach these scores, I contacted the authors who kindly shared a previous version of their codebase (before the clean up and open source release). With this "raw" code I was able to reproduce their results. By slowly updating this raw version to the current open-source release, I found that there is probably some bug in the current reanalyze.py, although I can't find it.

yix081 · 2022-06-15T16:39:51Z

@rPortelas I changed mcts_num_simulations, so technically I didn't reproduce the original results.

I can help to debug with "raw" code. I will contact the authors today for the code, and @rPortelas do you want to work together to debug reanalyze.py? Why do you think it is due to reanalyze.py?

szrlee · 2022-06-16T09:33:43Z

@yix081 Could you post your modifications here? Thank you!

yix081 · 2022-06-16T18:00:26Z

@szrlee
The only modification is mcts_num_simulations, and it doesn't work for other games (e.g. will hurt performance). It is very game-specific.

This is an accidental finding while I try different mcts_num_simulations. I don't encourage you to try that. I think the "raw" code @rPortelas mentioned is a better way to go.

rPortelas · 2022-06-16T19:46:43Z

Hello @szrlee @yix081 ,

Yes I would be happy to share my modifications, and collaborate on improving it :)
I contacted the authors by email to get their permission to share it (since they are based on their raw codebase). @YeWR

Meanwhile, I will also do additional experiments to see if I can reproduce results from other environments than CrazyClimber.

szrlee · 2022-06-22T07:57:13Z

@rPortelas @yix081 Thank you! Very interested in the reproducibility issue. Is there anything to share about the debugging in the current reanalyze.py?

rPortelas · 2022-06-22T13:55:57Z

Hello @szrlee @yix081,

I contacted the authors: they would like to look at this script by themselves (ASAP) to find the exact bug instead of releasing a half-baked reanalyze script.

Which means I cannot share my modifications on this issue, sorry.

dw-Zhao · 2022-06-26T05:55:02Z

Hello @szrlee @yix081 @rPortelas ,
Can you reproduce the results of Up N Down with the code in Github? I only get 3297.8, which is far away from 22717.5 written in the paper. Actually, I also cannot get as good results as the paper in some other games, but the gap is not big as CrazyClimber and UpNDown. (Seed=0)
Game Paper Real Run
Qbert 15458.1 11515.6
Assault 1436.3 1101.1
MsPacman 1465.0 839.4
Pong 20.6 14.2
Breakout 432.8 384.2
Asterix 18421.9 3237.5
UpNDown 22717.5 3297.8
CreazyClimber 98640.2 13584.3

yix081 · 2022-06-27T16:02:06Z

@dw-Zhao Confirmed with @rPortelas. We both can get 5k on average for UpNDown.

One thing we found is that you need to be careful with kornia version and maybe the OpenCV version too. Although we don't know the exactly right version to pick. Still investigating.

@dw-Zhao I suggest you and other people open new issues to get potential support from authors.

szrlee · 2022-08-01T08:55:44Z

Any update on this issue? Thanks

rPortelas mentioned this issue May 4, 2022

Zero score on Freeway #23

Open

vladisai mentioned this issue Oct 14, 2022

Cannot reproduce Breakout results #32

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reproduce the result of CrazyClimber #21

reproduce the result of CrazyClimber #21

yueyang130 commented Apr 6, 2022

yueyang130 commented Apr 6, 2022

rPortelas commented May 4, 2022

yix081 commented Jun 15, 2022

rPortelas commented Jun 15, 2022

yix081 commented Jun 15, 2022 •

edited

Loading

szrlee commented Jun 16, 2022 •

edited

Loading

yix081 commented Jun 16, 2022

rPortelas commented Jun 16, 2022 •

edited

Loading

szrlee commented Jun 22, 2022

rPortelas commented Jun 22, 2022

dw-Zhao commented Jun 26, 2022 •

edited

Loading

yix081 commented Jun 27, 2022

szrlee commented Aug 1, 2022

reproduce the result of CrazyClimber #21

reproduce the result of CrazyClimber #21

Comments

yueyang130 commented Apr 6, 2022

yueyang130 commented Apr 6, 2022

rPortelas commented May 4, 2022

Strengthening the relevance of this reproducibility issue

Potential reasons

yix081 commented Jun 15, 2022

rPortelas commented Jun 15, 2022

yix081 commented Jun 15, 2022 • edited Loading

szrlee commented Jun 16, 2022 • edited Loading

yix081 commented Jun 16, 2022

rPortelas commented Jun 16, 2022 • edited Loading

szrlee commented Jun 22, 2022

rPortelas commented Jun 22, 2022

dw-Zhao commented Jun 26, 2022 • edited Loading

yix081 commented Jun 27, 2022

szrlee commented Aug 1, 2022

yix081 commented Jun 15, 2022 •

edited

Loading

szrlee commented Jun 16, 2022 •

edited

Loading

rPortelas commented Jun 16, 2022 •

edited

Loading

dw-Zhao commented Jun 26, 2022 •

edited

Loading