Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix hf checkpointer #1489

Merged
merged 3 commits into from
Aug 27, 2024
Merged

Conversation

milocress
Copy link
Contributor

@milocress milocress commented Aug 27, 2024

Detach then reattach the mlflow logger process

Testing

tested manually in interactive:

$ mcli interactive --cluster r1z2 --gpus 1 --hours 24 --image mosaicml/llm-foundry:2.3.1_cu121-latest
# cd llm-foundry
# pip install -e .[gpu] --no-deps
# pip install composer==0.24.0

With foundry main:

  File "/usr/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

With milo-irene/fix-hf-checkpointer:

Uploading /tmp/tmpesp7mxuq/mlflow_save_0/model/model-00001-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████| 0.98G/0.98G [00:05<00:00, 205MiB/s]
Uploading artifacts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.32it/s]
2024-08-27 18:06:50,145: rank0[61228][MainThread]: INFO: composer.loggers.mlflow_logger: Successfully created model version 1 for model datasets.iamroot.ift-meta-llama-3-8b-ohuja8
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]████████| 0.98G/0.98G [00:05<00:00, 142MiB/s]
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

@milocress milocress requested a review from a team as a code owner August 27, 2024 17:49
Copy link
Contributor

@snarayan21 snarayan21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@milocress milocress enabled auto-merge (squash) August 27, 2024 18:16
@milocress milocress merged commit abdf7cf into mosaicml:main Aug 27, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants