-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620
Comments
seems like it crashed in xla/torch_xla/csrc/tensor_methods.cpp Lines 354 to 370 in cb4983e
@will-cromar can you take a look? |
@alanwaketan can you please also have a look here? |
@alanwaketan do you normally use the HuggingFace I tried to reproduce your crash on v4-8 with |
I do believe the normal torch.save should be compatible with FSDP. cc @jonb377 who is our ckpt expert. |
Yea, I do. All the Llama and Gemma works are done with HF trainer. But I don't recall we hit this issue before. |
Okay, I just scanned through the script and it looks like it has nothing to do with SPMD @jonb377. It’s probably just simple DP… Have no ideas why this will crash but we probably won’t be able to spend too much time on debugging this given mp is about to deprecate. |
it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :( |
Do you use DP or FSDP? |
hi @alanwaketan I think it is high related to the |
Hi. I encountered exact same issue as you did; even the vmem numbers are the exact same, and I tested with different llm with |
Hello, @shub-kris , I encountered a similar issue and have fixed it in huggingface/transformers#31264. Could you check if your issue has been resolved? |
🐛 Bug
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
To Reproduce
Create and SSH intoGoogle Cloud VM:
Install the packages
Run the test-transformers-trainer.py with
export PJRT_DEVICE=TPU python test-transformers-trainer.py --save_steps 100 --no_gradient_checkpointing
Entire Stack Trace
Expected behavior
The code should save the checkpoints successfully.
Environment
The text was updated successfully, but these errors were encountered: