[Bug] Checkpointing the buffer #188

belerico · 2024-01-15T05:54:16Z

Hi @Disastorm, I've copied here your new question, so that we keep closed the other issue:

@belerico
Hey just wondering how this buffer checkpointing works? I have

buffer:
  size: 1000000
  checkpoint: True

And so when resuming it doesn't do the pretraining buffer steps anymore, however I noticed the buffer files don't ever get updated, the last modified date is just when the first training started. Is this a problem? The files I'm referring to are the .memmap files, I see now it doesn't keep creating them for each run when checkpoint = True, so I assumed it would be using the ones from the previous run, but their update date isn't changing at all. Is it inside the checkpoint file itself? The filesize of the checkpoint still looks pretty similar to when running with checkpoint: False I think.

The text was updated successfully, but these errors were encountered:

Disastorm · 2024-01-15T16:58:39Z

thanks.
I'm guessing the buffer is perhaps not the same as the membuffer? seems like its an action history buffer or something like that.
However, it does seem to resume strangely when having this enabled, it ends up using the memory buffer from the previous run (although these membuffer files never seems to be updated, so I'm not really sure what its for), and the first resumption seems to work fine, however, upon the second cancellation, the membuffer files from the previous run ( the ones that are being used ) look like they get deleted, and so when trying to resume a second time, there are no membuffer files and it results in an error. Its also a bit confusing as it seems like what you end up having is like

first run folder/membuffer
second run checkpoint references first run folder's membuffer
third run references second run's folder+checkpoint which then references the first run's folder+membuffer.

In the case where you have buffer checkpoint False, each run just creates its own new membuffer again and then does the pretraining to fill up the action buffer or whatever it is, and everything works fine.

michele-milesi · 2024-01-16T12:01:16Z

Hi @Disastorm,
I will try to provide some clarity:

The first run contains the memory-mapped buffer (.memmap files in the memmap_buffer folder).
The second run resuming from 1. does not create any memmap file because the buffer instantiated from the checkpoint references the files in the first run memmap_buffer folder.
The third run, load the checkpoint from the second one (i.e, 2.). The buffer stored in the checkpoint (from the second run) recursively references the memmap in the log directory of the first run.
The buffer of the second run is saved in the checkpoint, but it references the files of the first run.
No other memmap files are created when resuming from a checkpoint (if buffer.checkpoint=True).

This means that the memmap_buffer folder of the first run must NOT be deleted.

I also tried to test it and it works (on Linux), I can restart an experiment multiple times:

First run
Second run resumed from 1.
Third run resumed from 2.
Fourth run resumed from 3.

This is the result of the test I carried out.
In particular, what I have done is the following:

I used the command python sheeprl.py exp=dreamer_v3_100k_ms_pacman checkpoint.every=100 for the first run.
Then I stopped the process (ctrl+C) when I saw that there was at least one checkpoint.
I resumed the training with the command python sheeprl.py exp=dreamer_v3_100k_ms_pacman checkpoint.every=100 checkpoint.resume_from=/path/to/first/run/checkpoint.ckpt (you must include .ckpt file in the path).
I repeated points 2 and 3 by resuming from a checkpoint of the last run (from the second run, then from the third one).

I understand that the logic of resume_from_checkpoint is convoluted and can be a bit confusing, sorry for that. I hope it is now clearer.
We will try to make some changes to make this process clearer.

Disastorm · 2024-01-17T02:59:08Z

Thanks yea thats basically the same behavior I saw, its just somehow there was something that was triggering the auto-delete of the memmap files from the first run. I don't really know what was triggering it but it seemed to potentially be related to when I was ctrl-c the run, not sure if i messed up a setting somewhere, or if its related to running on windows, etc. Anyway, I'm just using buffer.checkpoint = False for now so you can close this issue, but just wanted to mention that there may be some trigger somewhere that auto-deletes the memmap files from the first run when you do ctrl-c on one of the later runs.

belerico · 2024-01-17T08:31:46Z

Hey @Disastorm, we have indeed a problem with memmapped arrays on Windows: can you try out this branch pls?

Disastorm · 2024-01-18T04:42:35Z

Oh I see thanks. I'll try that out for whenever I try to train a new model, for the one I'm currently running I already have buffer.checkpoint = False so I can't try it on this one. or are you saying that actually even when checkpoint = False, the memmap is not working properly and I should use that branch regardless?

btw separate question, is there a way to set exploration in dreamerv3? would i adjust ent_coef, or do i need to use one of those other things like the Plan2Explore configs ( I don't know what Plan2Explore is ).

belerico · 2024-01-18T07:54:01Z

Oh I see thanks. I'll try that out for whenever I try to train a new model, for the one I'm currently running I already have buffer.checkpoint = False so I can't try it on this one. or are you saying that actually even when checkpoint = False, the memmap is not working properly and I should use that branch regardless?

Nope, the memmap is working properly, the problem arise when you checkpoint the buffer and try to resume multiple times, in that particular case the memmap buffer on Windows will be deleted. If you can try that new branch so we are sure it fixes your problem, then we can close the issue

btw separate question, is there a way to set exploration in dreamerv3? would i adjust ent_coef, or do i need to use one of those other things like the Plan2Explore configs ( I don't know what Plan2Explore is ).

I will open a new issue with the question to keep things in order

Disastorm · 2024-01-18T13:03:11Z

confirmed this is fixed.

belerico changed the title ~~Checkpointing the buffer~~ [Question] Checkpointing the buffer Jan 15, 2024

belerico added the question Further information is requested label Jan 15, 2024

belerico added bug Something isn't working and removed question Further information is requested labels Jan 17, 2024

belerico changed the title ~~[Question] Checkpointing the buffer~~ [Bug] Checkpointing the buffer Jan 17, 2024

belerico mentioned this issue Jan 19, 2024

Fix memory mapping on Windows #193

Merged

4 tasks

belerico closed this as completed in #193 Jan 19, 2024

Disastorm mentioned this issue Apr 27, 2024

dreamer v3 resuming problem #273

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Checkpointing the buffer #188

[Bug] Checkpointing the buffer #188

belerico commented Jan 15, 2024 •

edited

Loading

Disastorm commented Jan 15, 2024 •

edited

Loading

michele-milesi commented Jan 16, 2024

Disastorm commented Jan 17, 2024

belerico commented Jan 17, 2024 •

edited

Loading

Disastorm commented Jan 18, 2024 •

edited

Loading

belerico commented Jan 18, 2024

Disastorm commented Jan 18, 2024

[Bug] Checkpointing the buffer #188

[Bug] Checkpointing the buffer #188

Comments

belerico commented Jan 15, 2024 • edited Loading

Disastorm commented Jan 15, 2024 • edited Loading

michele-milesi commented Jan 16, 2024

Disastorm commented Jan 17, 2024

belerico commented Jan 17, 2024 • edited Loading

Disastorm commented Jan 18, 2024 • edited Loading

belerico commented Jan 18, 2024

Disastorm commented Jan 18, 2024

belerico commented Jan 15, 2024 •

edited

Loading

Disastorm commented Jan 15, 2024 •

edited

Loading

belerico commented Jan 17, 2024 •

edited

Loading

Disastorm commented Jan 18, 2024 •

edited

Loading