fix hf checkpointer #1489

milocress · 2024-08-27T17:49:24Z

Detach then reattach the mlflow logger process

Testing

tested manually in interactive:

$ mcli interactive --cluster r1z2 --gpus 1 --hours 24 --image mosaicml/llm-foundry:2.3.1_cu121-latest
# cd llm-foundry
# pip install -e .[gpu] --no-deps
# pip install composer==0.24.0

With foundry main:

  File "/usr/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/usr/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object

With milo-irene/fix-hf-checkpointer:

Uploading /tmp/tmpesp7mxuq/mlflow_save_0/model/model-00001-of-00003.safetensors: 100%|███████████████████████████████████████████████████████████████| 0.98G/0.98G [00:05<00:00, 205MiB/s]
Uploading artifacts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.32it/s]
2024-08-27 18:06:50,145: rank0[61228][MainThread]: INFO: composer.loggers.mlflow_logger: Successfully created model version 1 for model datasets.iamroot.ift-meta-llama-3-8b-ohuja8
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]████████| 0.98G/0.98G [00:05<00:00, 142MiB/s]
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.

snarayan21

lgtm!

fix hf checkpointer?

594bb20

milocress requested a review from a team as a code owner August 27, 2024 17:49

fix

ff60aeb

milocress requested a review from irenedea August 27, 2024 18:12

snarayan21 approved these changes Aug 27, 2024

View reviewed changes

milocress enabled auto-merge (squash) August 27, 2024 18:16

ignore monitor process type

615d89d

milocress merged commit abdf7cf into mosaicml:main Aug 27, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hf checkpointer #1489

fix hf checkpointer #1489

milocress commented Aug 27, 2024 •

edited

Loading

snarayan21 left a comment

fix hf checkpointer #1489

fix hf checkpointer #1489

Conversation

milocress commented Aug 27, 2024 • edited Loading

Testing

snarayan21 left a comment

Choose a reason for hiding this comment

milocress commented Aug 27, 2024 •

edited

Loading