-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make GCS-style checkpointing configurable instead deciding it on is_gcs_path function output #710
Comments
Curious for more context on why you want this. The only thing this really controls is whether the COMMIT_SUCCESS.txt file is written or atomic rename is used to indicate atomic save success, so are you just wanting COMMIT_SUCCESS.txt file in non-GCS contexts? I'll add that atomic rename on GCS is supposed to be supported as of recently, so I was actually planning to get rid of COMMIT_SUCCESS.txt behavior entirely and replace everything with atomic rename. Having two different ways of indicating whether a checkpoint is complete just creates compatibility problems and adds more complexity for us and users. |
Yes, exactly. Writing straight to final directory and marking writing as "finished" with
I understand your concern regarding complexity and compability, but there is a few extremely popular storage backends (which orbax might want to support in the future, i hope so at least) that doesn't have atomic rename/atomic rename, like Amazon S3 or any S3-compatible storages. And GCS right now. If you're really commited to end with only one of behaviors (tmp dir with atomic rename or COMMIT_SUCCESS.txt file), is there any problem to stick to only COMMIT_SUCCESS.txt behavior? This would automatically support both storage types but i might be missing some in-depth context on why that's is not universal. |
Hi, apologies for the delayed response. I'm sympathetic to this request - I think we can allow this behavior to be configurable. However, it isn't so easy to provide, unfortunately, because it's not just the So I'm not that keen on plumbing this option to all of those places, especially since we could imagine another FR along the lines of "enable marking commit success by writing to an external database". Probably the way we would want to go is to provide some sort of Concern with going with COMMIT_SUCCESS.txt globally is for legacy compatibility - it becomes difficult to distinguish a checkpoint written successfully before the change from a checkpoint written unsuccessfully after the change. Probably it's not impossible though, so we can consider that approach. I don't think we'll have time to prioritize this right away, but it'll be on our TODO list. |
Ok, thank you! |
Hi, i'm using s3fs which has very similar restrictions as GCS (no atomic moves, for example).
To enable GCS-style checkpointing I'm patching orbax which seems very ugly
Instead, I would prefer to have a parameter in
ocp.CheckpointManagerOptions
that would enable this checkpointing style.Its default value may be False and in
ocp.CheckpointManager.__init__
there could beself._gcs_style_checkpointing = gcs_style_checkpointing or is_gcs_path(directory)
which would be totally backward compatible.The text was updated successfully, but these errors were encountered: