-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: unmanaged experiment checkpoint storage path #9625
Conversation
✅ Deploy Preview for determined-ui ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9625 +/- ##
=======================================
Coverage 53.31% 53.32%
=======================================
Files 1254 1254
Lines 152657 152666 +9
Branches 3243 3242 -1
=======================================
+ Hits 81395 81402 +7
- Misses 71110 71112 +2
Partials 152 152
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of it looks great, left some questions on minor things (potentially a bug?)
or str(pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints") | ||
) | ||
|
||
if type(checkpoint_storage) == str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible someone passes in checkpoint_storage
or configures it in the DefaultConfig
as a string, right? so this would effectively override the passed in value?
this snippet might be cleaner as:
checkpoint_storage = checkpoint_storage or defaulted_config.checkpoint_storage
if checkpoint_storage is None:
checkpoint_storage = {
"type": "directory",
"container_path": str(
pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints"
),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but A or B
only equals to B when A is False/None/empty, right?
Also the checkpoint_storage
parameter and defaulted_config.checkpoint_storage
are both string, so it needs to be wrapped too before putting into the config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant that it's possible for checkpoint storage to be passed in as a string but it might not be directory
type, i think?
looks like we have some logic in _core_context_v2.py
that builds a storage manager from checkpoint_storage
, and we accept string types that can be gs/s3 types:
def _get_storage_manager(
checkpoint_storage: Optional[Union[str, Dict[str, Any]]]
) -> Optional[storage.StorageManager]:
if checkpoint_storage is None:
return None
if isinstance(checkpoint_storage, str):
return storage.from_string(checkpoint_storage)
maybe this codepath doesn't get called? i haven't read the code super thoroughly.
i thought what we were trying to do here is set checkpoint_storage
to {"type": "directory", "container_path": checkpoint_storage}
only in the default case when it wasn't passed in. but if it is passed in, either directly as a parameter or in defaulted_config
, then i'm not sure we can assume it's type=directory and the container_path
.
at least that's what i was thinking when i commented. if it should always be directory/container_path, then ignore me 🙃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for the explanation!
I noticed that we used to only parse non-cloud storage to shared_fs
, both in storage.from_string and _shortcut_to_config, is there a way to distinguish between type shared_fs
and directory
?
Upon testing, I don't think there is going to be any difference in terms of WebUI display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, yeah there's no way to distinguish between "file://" for shared fs vs directory if passed in. i think we can determine it by on vs off cluster.
i think there's two issues here though: 1) how do we parse a file://
string when it's passed in, and 2) what do we default to when nothing is passed in. for 1) on-cluster should parse as shared fs, off as directory. for 2, we should try cluster storage if on cluster, then default to shared_fs, if off cluster we should default to directory type
like:
if checkpoint_storage is a string type:
if on_cluster:
parse as shared_fs
else:
parse as directory
if checkpoint_storage is None:
path = pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints"
if on_cluster:
# try on cluster storage, then shared_fs
storage = info.trial._config.get("checkpoint_storage")
if not storage:
storage = shared_fs(path)
else:
directory(path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback!
So for on-cluster experiments, setting checkpoint storage is handled inside _make_v2_context, and we probably don't need to change the logic there.
Then for off-cluster experiments, we can explicitly set the checkpoint storage to an object, so it will be the same type either in config and in context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
web portion looks great!
storage.shared._shortcut_to_config(checkpoint_storage, False) # type: ignore | ||
if type(checkpoint_storage) == str | ||
else checkpoint_storage | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- i'd rather not change
_shortcut_to_config
because this if on/off cluster logic is pretty specific to detached mode, at least for the time being (i.e. directory type can be used on cluster, too). we should put more thought into how to parsefile://
strings outside of detached mode. so for now could we just not use that util method and instantiate the dict object directly here? - i don't think we need to parse checkpoint strings into dicts here, 'cause it seems like
_make_v2_context
already does that down the line.
maybe i'm missing something but i think this could be:
checkpoint_storage = checkpoint_storage or config.checkpoint_storage
if checkpoint_storage is None:
checkpoint_storage = {
"type": "directory",
"container_path": str(pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints")
}
seems a bit cleaner IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do this because checkpoint_storage or config.checkpoint_storage
can be string, and the checkpoint storage field in the experiment config needs to be an object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we cannot distinguish between type share_fs
and directory
from string, either default everything to share_fs
as we do now, or choose default by on/off cluster sounds reasonable to me. If an user really wants to specify share_fs
or directory
type, the best option would be to pass in the whole object instead of just the string.
One thing that's a bit confusing to me is that we do not allow checkpoint storage object to have type share_fs
here, though this shouldn't affect detached mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the confusion, this was trickier than i thought. thanks for pushing through with this!
Ticket
MD-222
Description
Fix the default checkpoint storage path for unmanaged experiments.
Test Plan
Creating an unmanaged experiment following the instruction in the attached ticket
Navigate to checkpoint tab at the experiment details page
Open checkpoint modal, and the checkpoint location should be a correct local address instead of somewhere on the cloud
Checklist
docs/release-notes/
See Release Note for details.