Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: unmanaged experiment checkpoint storage path #9625

Merged
merged 12 commits into from
Jul 23, 2024
Merged

Conversation

gt2345
Copy link
Contributor

@gt2345 gt2345 commented Jul 9, 2024

Ticket

MD-222

Description

Fix the default checkpoint storage path for unmanaged experiments.

Test Plan

Creating an unmanaged experiment following the instruction in the attached ticket
Navigate to checkpoint tab at the experiment details page
Open checkpoint modal, and the checkpoint location should be a correct local address instead of somewhere on the cloud

Screenshot 2024-07-09 at 4 23 18 PM

Checklist

  • Changes have been manually QA'd
  • New features have been approved by the corresponding PM
  • User-facing API changes have the "User-facing API Change" label
  • Release notes have been added as a separate file under docs/release-notes/
    See Release Note for details.
  • Licenses have been included for new code which was copied and/or modified from any external code

@gt2345 gt2345 requested review from a team as code owners July 9, 2024 21:25
@gt2345 gt2345 requested review from hkang1 and jgongd July 9, 2024 21:25
@cla-bot cla-bot bot added the cla-signed label Jul 9, 2024
@gt2345 gt2345 requested a review from ioga July 9, 2024 21:25
Copy link

netlify bot commented Jul 9, 2024

Deploy Preview for determined-ui ready!

Name Link
🔨 Latest commit a2461f9
🔍 Latest deploy log https://app.netlify.com/sites/determined-ui/deploys/669712b6ed534b00086d78eb
😎 Deploy Preview https://deploy-preview-9625--determined-ui.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@gt2345 gt2345 requested review from azhou-determined and removed request for jgongd July 9, 2024 21:26
Copy link

codecov bot commented Jul 9, 2024

Codecov Report

Attention: Patch coverage is 35.71429% with 9 lines in your changes missing coverage. Please review.

Project coverage is 53.32%. Comparing base (4b3a100) to head (a2461f9).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9625   +/-   ##
=======================================
  Coverage   53.31%   53.32%           
=======================================
  Files        1254     1254           
  Lines      152657   152666    +9     
  Branches     3243     3242    -1     
=======================================
+ Hits        81395    81402    +7     
- Misses      71110    71112    +2     
  Partials      152      152           
Flag Coverage Δ
backend 44.75% <ø> (+<0.01%) ⬆️
harness 72.84% <50.00%> (ø)
web 51.55% <33.33%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
webui/react/src/ioTypes.ts 90.99% <100.00%> (+0.02%) ⬆️
webui/react/src/types.ts 99.69% <100.00%> (+<0.01%) ⬆️
harness/determined/common/storage/shared.py 82.22% <50.00%> (ø)
webui/react/src/services/decoder.ts 21.60% <0.00%> (-0.03%) ⬇️
webui/react/src/components/CheckpointModal.tsx 77.82% <22.22%> (-1.25%) ⬇️

... and 6 files with indirect coverage changes

Copy link
Contributor

@hkang1 hkang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of it looks great, left some questions on minor things (potentially a bug?)

webui/react/src/services/decoder.ts Show resolved Hide resolved
webui/react/src/components/CheckpointModal.tsx Outdated Show resolved Hide resolved
webui/react/src/components/CheckpointModal.tsx Outdated Show resolved Hide resolved
or str(pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints")
)

if type(checkpoint_storage) == str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's possible someone passes in checkpoint_storage or configures it in the DefaultConfig as a string, right? so this would effectively override the passed in value?

this snippet might be cleaner as:

    checkpoint_storage = checkpoint_storage or defaulted_config.checkpoint_storage

    if checkpoint_storage is None:
        checkpoint_storage = {
            "type": "directory",
            "container_path": str(
                pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints"
            ),
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but A or B only equals to B when A is False/None/empty, right?
Also the checkpoint_storage parameter and defaulted_config.checkpoint_storage are both string, so it needs to be wrapped too before putting into the config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant that it's possible for checkpoint storage to be passed in as a string but it might not be directory type, i think?

looks like we have some logic in _core_context_v2.py that builds a storage manager from checkpoint_storage, and we accept string types that can be gs/s3 types:

def _get_storage_manager(
    checkpoint_storage: Optional[Union[str, Dict[str, Any]]]
) -> Optional[storage.StorageManager]:
    if checkpoint_storage is None:
        return None
    if isinstance(checkpoint_storage, str):
        return storage.from_string(checkpoint_storage)

maybe this codepath doesn't get called? i haven't read the code super thoroughly.

i thought what we were trying to do here is set checkpoint_storage to {"type": "directory", "container_path": checkpoint_storage} only in the default case when it wasn't passed in. but if it is passed in, either directly as a parameter or in defaulted_config, then i'm not sure we can assume it's type=directory and the container_path.

at least that's what i was thinking when i commented. if it should always be directory/container_path, then ignore me 🙃

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation!
I noticed that we used to only parse non-cloud storage to shared_fs, both in storage.from_string and _shortcut_to_config, is there a way to distinguish between type shared_fs and directory?
Upon testing, I don't think there is going to be any difference in terms of WebUI display.

Copy link
Contributor

@azhou-determined azhou-determined Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, yeah there's no way to distinguish between "file://" for shared fs vs directory if passed in. i think we can determine it by on vs off cluster.

i think there's two issues here though: 1) how do we parse a file:// string when it's passed in, and 2) what do we default to when nothing is passed in. for 1) on-cluster should parse as shared fs, off as directory. for 2, we should try cluster storage if on cluster, then default to shared_fs, if off cluster we should default to directory type

like:

if checkpoint_storage is a string type:
    if on_cluster:
        parse as shared_fs
    else:
        parse as directory
if checkpoint_storage is None:
    path = pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints"
    if on_cluster:
        # try on cluster storage, then shared_fs
        storage = info.trial._config.get("checkpoint_storage")
        if not storage:
            storage = shared_fs(path)
    else:
        directory(path)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback!
So for on-cluster experiments, setting checkpoint storage is handled inside _make_v2_context, and we probably don't need to change the logic there.
Then for off-cluster experiments, we can explicitly set the checkpoint storage to an object, so it will be the same type either in config and in context.

Copy link
Contributor

@hkang1 hkang1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

web portion looks great!

storage.shared._shortcut_to_config(checkpoint_storage, False) # type: ignore
if type(checkpoint_storage) == str
else checkpoint_storage
)
Copy link
Contributor

@azhou-determined azhou-determined Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. i'd rather not change _shortcut_to_config because this if on/off cluster logic is pretty specific to detached mode, at least for the time being (i.e. directory type can be used on cluster, too). we should put more thought into how to parse file:// strings outside of detached mode. so for now could we just not use that util method and instantiate the dict object directly here?
  2. i don't think we need to parse checkpoint strings into dicts here, 'cause it seems like _make_v2_context already does that down the line.

maybe i'm missing something but i think this could be:

checkpoint_storage = checkpoint_storage or config.checkpoint_storage
if checkpoint_storage is None:
    checkpoint_storage = {
        "type": "directory",
        "container_path": str(pathlib.Path(appdirs.user_data_dir("determined")) / "checkpoints")
    }

seems a bit cleaner IMO

Copy link
Contributor Author

@gt2345 gt2345 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do this because checkpoint_storage or config.checkpoint_storage can be string, and the checkpoint storage field in the experiment config needs to be an object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we cannot distinguish between type share_fs and directory from string, either default everything to share_fs as we do now, or choose default by on/off cluster sounds reasonable to me. If an user really wants to specify share_fs or directory type, the best option would be to pass in the whole object instead of just the string.
One thing that's a bit confusing to me is that we do not allow checkpoint storage object to have type share_fs here, though this shouldn't affect detached mode.

Copy link
Contributor

@azhou-determined azhou-determined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the confusion, this was trickier than i thought. thanks for pushing through with this!

@gt2345 gt2345 merged commit 8776e35 into main Jul 23, 2024
85 of 98 checks passed
@gt2345 gt2345 deleted the gt/222-unmanaged-ckp branch July 23, 2024 17:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants