Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Checkpointing the buffer #188

Closed
belerico opened this issue Jan 15, 2024 · 7 comments · Fixed by #193
Closed

[Bug] Checkpointing the buffer #188

belerico opened this issue Jan 15, 2024 · 7 comments · Fixed by #193
Labels
bug Something isn't working

Comments

@belerico
Copy link
Member

belerico commented Jan 15, 2024

Hi @Disastorm, I've copied here your new question, so that we keep closed the other issue:

@belerico
Hey just wondering how this buffer checkpointing works? I have

buffer:
  size: 1000000
  checkpoint: True

And so when resuming it doesn't do the pretraining buffer steps anymore, however I noticed the buffer files don't ever get updated, the last modified date is just when the first training started. Is this a problem? The files I'm referring to are the .memmap files, I see now it doesn't keep creating them for each run when checkpoint = True, so I assumed it would be using the ones from the previous run, but their update date isn't changing at all. Is it inside the checkpoint file itself? The filesize of the checkpoint still looks pretty similar to when running with checkpoint: False I think.

@belerico belerico changed the title Checkpointing the buffer [Question] Checkpointing the buffer Jan 15, 2024
@belerico belerico added the question Further information is requested label Jan 15, 2024
@Disastorm
Copy link

Disastorm commented Jan 15, 2024

thanks.
I'm guessing the buffer is perhaps not the same as the membuffer? seems like its an action history buffer or something like that.
However, it does seem to resume strangely when having this enabled, it ends up using the memory buffer from the previous run (although these membuffer files never seems to be updated, so I'm not really sure what its for), and the first resumption seems to work fine, however, upon the second cancellation, the membuffer files from the previous run ( the ones that are being used ) look like they get deleted, and so when trying to resume a second time, there are no membuffer files and it results in an error. Its also a bit confusing as it seems like what you end up having is like

  • first run folder/membuffer
  • second run checkpoint references first run folder's membuffer
  • third run references second run's folder+checkpoint which then references the first run's folder+membuffer.

In the case where you have buffer checkpoint False, each run just creates its own new membuffer again and then does the pretraining to fill up the action buffer or whatever it is, and everything works fine.

@michele-milesi
Copy link
Member

Hi @Disastorm,
I will try to provide some clarity:

  1. The first run contains the memory-mapped buffer (.memmap files in the memmap_buffer folder).
  2. The second run resuming from 1. does not create any memmap file because the buffer instantiated from the checkpoint references the files in the first run memmap_buffer folder.
  3. The third run, load the checkpoint from the second one (i.e, 2.). The buffer stored in the checkpoint (from the second run) recursively references the memmap in the log directory of the first run.
    The buffer of the second run is saved in the checkpoint, but it references the files of the first run.
    No other memmap files are created when resuming from a checkpoint (if buffer.checkpoint=True).

This means that the memmap_buffer folder of the first run must NOT be deleted.

I also tried to test it and it works (on Linux), I can restart an experiment multiple times:

  1. First run
  2. Second run resumed from 1.
  3. Third run resumed from 2.
  4. Fourth run resumed from 3.

multiple_resume_from
This is the result of the test I carried out.
In particular, what I have done is the following:

  1. I used the command python sheeprl.py exp=dreamer_v3_100k_ms_pacman checkpoint.every=100 for the first run.
  2. Then I stopped the process (ctrl+C) when I saw that there was at least one checkpoint.
  3. I resumed the training with the command python sheeprl.py exp=dreamer_v3_100k_ms_pacman checkpoint.every=100 checkpoint.resume_from=/path/to/first/run/checkpoint.ckpt (you must include .ckpt file in the path).
  4. I repeated points 2 and 3 by resuming from a checkpoint of the last run (from the second run, then from the third one).

I understand that the logic of resume_from_checkpoint is convoluted and can be a bit confusing, sorry for that. I hope it is now clearer.
We will try to make some changes to make this process clearer.

@Disastorm
Copy link

Thanks yea thats basically the same behavior I saw, its just somehow there was something that was triggering the auto-delete of the memmap files from the first run. I don't really know what was triggering it but it seemed to potentially be related to when I was ctrl-c the run, not sure if i messed up a setting somewhere, or if its related to running on windows, etc. Anyway, I'm just using buffer.checkpoint = False for now so you can close this issue, but just wanted to mention that there may be some trigger somewhere that auto-deletes the memmap files from the first run when you do ctrl-c on one of the later runs.

@belerico
Copy link
Member Author

belerico commented Jan 17, 2024

Hey @Disastorm, we have indeed a problem with memmapped arrays on Windows: can you try out this branch pls?

@belerico belerico added bug Something isn't working and removed question Further information is requested labels Jan 17, 2024
@belerico belerico changed the title [Question] Checkpointing the buffer [Bug] Checkpointing the buffer Jan 17, 2024
@Disastorm
Copy link

Disastorm commented Jan 18, 2024

Oh I see thanks. I'll try that out for whenever I try to train a new model, for the one I'm currently running I already have buffer.checkpoint = False so I can't try it on this one. or are you saying that actually even when checkpoint = False, the memmap is not working properly and I should use that branch regardless?

btw separate question, is there a way to set exploration in dreamerv3? would i adjust ent_coef, or do i need to use one of those other things like the Plan2Explore configs ( I don't know what Plan2Explore is ).

@belerico
Copy link
Member Author

Oh I see thanks. I'll try that out for whenever I try to train a new model, for the one I'm currently running I already have buffer.checkpoint = False so I can't try it on this one. or are you saying that actually even when checkpoint = False, the memmap is not working properly and I should use that branch regardless?

Nope, the memmap is working properly, the problem arise when you checkpoint the buffer and try to resume multiple times, in that particular case the memmap buffer on Windows will be deleted. If you can try that new branch so we are sure it fixes your problem, then we can close the issue

btw separate question, is there a way to set exploration in dreamerv3? would i adjust ent_coef, or do i need to use one of those other things like the Plan2Explore configs ( I don't know what Plan2Explore is ).

I will open a new issue with the question to keep things in order

@Disastorm
Copy link

confirmed this is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants