SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620

shub-kris · 2024-02-27T16:26:58Z

🐛 Bug

concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

To Reproduce

Create and SSH intoGoogle Cloud VM:

 gcloud alpha compute tpus tpu-vm create tpu-vm --zone=us-west4-a --accelerator-type=v5litepod-8 --version v2-alpha-tpuv5-lite
 gcloud alpha compute tpus tpu-vm ssh tpu-vm --zone=us-west4-a

Install the packages

pip install torch~=2.1.0 torch_xla[tpu]~=2.1.0 -f https://storage.googleapis.com/libtpu-releases/index.html
pip install transformers==4.37.2 accelerate==0.27.0 datasets==2.16.1

Run the test-transformers-trainer.py with

export PJRT_DEVICE=TPU
python test-transformers-trainer.py --save_steps 100 --no_gradient_checkpointing

Entire Stack Trace

WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
WARNING:root:Unsupported nprocs (8), ignoring...
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.563769     202 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.599691     206 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.611669     205 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.672515     208 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.780619     211 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.946702     210 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967984.953963     207 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1708967985.202261     209 tpu_initializer_framework_helper.cc:78] Libtpu path is: /usr/local/lib/python3.10/site-packages/torch_xla/lib/libtpu.so
I0000 00:00:1708967998.689161     206 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.754534     211 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.778158     205 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.852075     210 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.912706     209 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.963000     202 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.963753     208 pjrt_c_api_client.cc:110] PjRtCApiClient created.
I0000 00:00:1708967998.980441     207 pjrt_c_api_client.cc:110] PjRtCApiClient created.
Downloading readme: 100%|████████████████████████████| 8.20k/8.20k [00:00<00:00, 26.9MB/s]
Downloading data: 100%|██████████████████████████████| 13.1M/13.1M [00:00<00:00, 17.7MB/s]
Generating train split: 15011 examples [00:00, 424303.49 examples/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:00<00:00, 16078.72 examples/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:00<00:00, 16078.77 examples/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 14403.63 examples/s]
tokenizer_config.json: 100%|█████████████████████████████| 685/685 [00:00<00:00, 5.92MB/s]
config.json: 100%|███████████████████████████████████████| 651/651 [00:00<00:00, 5.21MB/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 11107.38 examples/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 10320.80 examples/s]
Map: 100%|████████████████████████████████| 15011/15011 [00:01<00:00, 10428.84 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:01<00:00, 9895.84 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:01<00:00, 9751.13 examples/s]
vocab.json: 100%|██████████████████████████████████████| 899k/899k [00:00<00:00, 6.72MB/s]
merges.txt: 100%|██████████████████████████████████████| 456k/456k [00:00<00:00, 1.17MB/s]
special_tokens_map.json: 100%|███████████████████████████| 441/441 [00:00<00:00, 4.28MB/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1660.92 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1549.74 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:09<00:00, 1516.41 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:10<00:00, 1403.17 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.44 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.12 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1338.69 examples/s]
Map: 100%|█████████████████████████████████| 15011/15011 [00:11<00:00, 1343.30 examples/s]
pytorch_model.bin: 100%|████████████████████████████████| 251M/251M [00:01<00:00, 128MB/s]
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
/usr/local/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  return self.fget.__get__(instance, owner)()
generation_config.json: 100%|█████████████████████████████| 137/137 [00:00<00:00, 368kB/s]
{'loss': 7.3, 'learning_rate': 0.0007774011299435029, 'epoch': 0.08}                      
{'loss': 6.275, 'learning_rate': 0.0007548022598870056, 'epoch': 0.17}                    
{'loss': 5.775, 'learning_rate': 0.0007322033898305085, 'epoch': 0.25}                    
{'loss': 5.225, 'learning_rate': 0.0007096045197740113, 'epoch': 0.34}                    
{'loss': 5.05, 'learning_rate': 0.0006870056497175141, 'epoch': 0.42}                     
{'loss': 4.825, 'learning_rate': 0.0006644067796610169, 'epoch': 0.51}                    
{'loss': 4.7, 'learning_rate': 0.0006418079096045198, 'epoch': 0.59}                      
{'loss': 4.6, 'learning_rate': 0.0006192090395480226, 'epoch': 0.68}                      
{'loss': 4.5, 'learning_rate': 0.0005966101694915254, 'epoch': 0.76}                      
{'loss': 4.5, 'learning_rate': 0.0005740112994350283, 'epoch': 0.85}                      
 28%|██████████████▍                                    | 100/354 [01:52<02:07,  1.99it/s]https://symbolize.stripped_domain/r/?trace=7f0a62182524,7f0d36c36d5f,7f0a6206407e,7f0a62053a2d,7f0d36f4e6ad,7f0d370fdfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f0a5d943000-7f0a6bf45ac0 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 205 (TID 5162) on cpu 27; stack trace: ***
https://symbolize.stripped_domain/r/?trace=7fcf3e047524,7fd212afbd5f,7fcf3df2907e,7fcf3df18a2d,7fd212e136ad,7fd212fc2fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fcf39808000-7fcf47e0aac0 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 206 (TID 5122) on cpu 11; stack trace: ***
https://symbolize.stripped_domain/r/?trace=7f8e73790524,7f9148244d5f,7f8e7367207e,7f8e73661a2d,7f914855c6ad,7f914870bfff&map=https://symbolize.stripped_domain/r/?trace=https://symbolize.stripped_domain/r/?trace=7fe0d83ee524,7fc56e9e8524,7fe3acea2d5f,https://symbolize.stripped_domain/r/?trace=7fc84349cd5f,7f57575b9524,7fe0d82d007e,bdcf1f91b8790c8a971d2904a194674945111543:7f8e6ef51000-7f8e7d553ac07fc56e8ca07e,7f5a2c06dd5f,7fe0d82bfa2d,7fc56e8b9a2d,7f575749b07e,7fe3ad1ba6ad,7fc8437b46ad,7f575748aa2d,7fe3ad369fff7fc843963fff 
7f5a2c3856ad,&map=&map=*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 211 (TID 5143) on cpu 87; stack trace: ***
7f5a2c534fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f5752d7a000-7f576137cac0bdcf1f91b8790c8a971d2904a194674945111543:7fe0d3baf000-7fe0e21b1ac0bdcf1f91b8790c8a971d2904a194674945111543:7fc56a1a9000-7fc5787abac0 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 202 (TID 5238) on cpu 165; stack trace: ***
 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 208 (TID 5244) on cpu 184; stack trace: ***
 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 209 (TID 5204) on cpu 0; stack trace: ***
https://symbolize.stripped_domain/r/?trace=7f68745ea524,7f6b4909ed5f,7f68744cc07e,7f68744bba2d,7f6b493b66ad,7f6b49565fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f686fdab000-7f687e3adac0 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 210 (TID 5182) on cpu 56; stack trace: ***
https://symbolize.stripped_domain/r/?trace=7f9495bd1524,7f976a685d5f,7f9495ab307e,7f9495aa2a2d,7f976a99d6ad,7f976ab4cfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f9491392000-7f949f994ac0 
*** SIGSEGV (@(nil)), see gl__________60#s15 received by PID 207 (TID 5259) on cpu 38; stack trace: ***
PC: @     0x7f0a62182524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7f09ab75621a       1152  (unknown)
    @     0x7f0d36c36d60       1648  (unknown)
PC: @     0x7fcf3e047524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7fce8761921a       1152  (unknown)
    @     0x7fd212afbd60       1648  (unknown)
PC: @     0x7fc56e9e8524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7fc4b7fbd21a       1152  (unknown)
    @     0x7fc84349cd60       1648  (unknown)
PC: @     0x7f57575b9524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7f56a0b8821a       1152  (unknown)
    @     0x7f5a2c06dd60       1648  (unknown)
PC: @     0x7f8e73790524  (unknown)  torch_xla::tensor_methods::all_reduce()
PC: @     0x7fe0d83ee524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7f8dbcd6021a       1152  (unknown)
    @     0x7fe0219c421a       1152  (unknown)
    @     0x7f9148244d60       1648  (unknown)
    @     0x7fe3acea2d60       1648  (unknown)
PC: @     0x7f9495bd1524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7f93df19e21a       1152  (unknown)
    @     0x7f976a685d60       1648  (unknown)
PC: @     0x7f68745ea524  (unknown)  torch_xla::tensor_methods::all_reduce()
    @     0x7f67bdbbb21a       1152  (unknown)
    @     0x7f6b4909ed60       1648  (unknown)
    @     0x7f0a6206407f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7fcf3df2907f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7fc56e8ca07f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f575749b07f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7fe0d82d007f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f9495ab307f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f68744cc07f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f8e7367207f        256  pybind11::cpp_function::initialize<>()::{lambda()#3}::_FUN()
    @     0x7f0a62053a2e        512  pybind11::cpp_function::dispatcher()
    @     0x7f0d36f4e6ae  (unknown)  cfunction_call
    @     0x7f0d370fe000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f0a62182524,7f09ab756219,7f0d36c36d5f,7f0a6206407e,7f0a62053a2d,7f0d36f4e6ad,7f0d370fdfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f0a5d943000-7f0a6bf45ac0,bd189fb7b9de62cf44fe27cae177f396:7f099e14c000-7f09ab967670 
E0226 17:22:24.766103    5162 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.766116    5162 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.766120    5162 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.766159    5162 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.766165    5162 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7fcf3df18a2e        512  pybind11::cpp_function::dispatcher()
    @     0x7fd212e136ae  (unknown)  cfunction_call
    @     0x7fd212fc3000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fcf3e047524,7fce87619219,7fd212afbd5f,7fcf3df2907e,7fcf3df18a2d,7fd212e136ad,7fd212fc2fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fcf39808000-7fcf47e0aac0,bd189fb7b9de62cf44fe27cae177f396:7fce7a00f000-7fce8782a670 
E0226 17:22:24.778616    5122 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.778630    5122 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.778634    5122 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.778661    5122 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.778667    5122 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7fc56e8b9a2e        512  pybind11::cpp_function::dispatcher()
    @     0x7fc8437b46ae  (unknown)  cfunction_call
    @     0x7f575748aa2e        512  pybind11::cpp_function::dispatcher()
    @     0x7fe0d82bfa2e        512  pybind11::cpp_function::dispatcher()
    @     0x7f9495aa2a2e        512  pybind11::cpp_function::dispatcher()
    @     0x7f5a2c3856ae  516988624  cfunction_call
    @     0x7f68744bba2e        512  pybind11::cpp_function::dispatcher()
    @     0x7fc843964000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fc56e9e8524,7fc4b7fbd219,7fc84349cd5f,7fc56e8ca07e,7fc56e8b9a2d,7fc8437b46ad,7fc843963fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fc56a1a9000-7fc5787abac0,bd189fb7b9de62cf44fe27cae177f396:7fc4aa9b3000-7fc4b81ce670 
E0226 17:22:24.800109    5204 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.800122    5204 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.800126    5204 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.800155    5204 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.800160    5204 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7f8e73661a2e        512  pybind11::cpp_function::dispatcher()
    @     0x7fe3ad1ba6ae  (unknown)  cfunction_call
    @     0x7f976a99d6ae  (unknown)  cfunction_call
    @     0x7f6b493b66ae  (unknown)  cfunction_call
    @     0x7f914855c6ae  (unknown)  cfunction_call
    @     0x7f5a2c535000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f57575b9524,7f56a0b88219,7f5a2c06dd5f,7f575749b07e,7f575748aa2d,7f5a2c3856ad,7f5a2c534fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f5752d7a000-7f576137cac0,bd189fb7b9de62cf44fe27cae177f396:7f569357e000-7f56a0d99670 
E0226 17:22:24.802639    5238 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.802654    5238 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.802658    5238 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.802683    5238 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.802689    5238 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7fe3ad36a000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fe0d83ee524,7fe0219c4219,7fe3acea2d5f,7fe0d82d007e,7fe0d82bfa2d,7fe3ad1ba6ad,7fe3ad369fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7fe0d3baf000-7fe0e21b1ac0,bd189fb7b9de62cf44fe27cae177f396:7fe0143ba000-7fe021bd5670 
E0226 17:22:24.804879    5244 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.804890    5244 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.804894    5244 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.804918    5244 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.804924    5244 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7f976ab4d000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f9495bd1524,7f93df19e219,7f976a685d5f,7f9495ab307e,7f9495aa2a2d,7f976a99d6ad,7f976ab4cfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f9491392000-7f949f994ac0,bd189fb7b9de62cf44fe27cae177f396:7f93d1b94000-7f93df3af670 
E0226 17:22:24.804975    5259 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.804988    5259 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.804992    5259 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.805019    5259 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.805025    5259 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7f6b49566000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f68745ea524,7f67bdbbb219,7f6b4909ed5f,7f68744cc07e,7f68744bba2d,7f6b493b66ad,7f6b49565fff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f686fdab000-7f687e3adac0,bd189fb7b9de62cf44fe27cae177f396:7f67b05b1000-7f67bddcc670 
E0226 17:22:24.805818    5182 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.805833    5182 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.805838    5182 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.805860    5182 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.805867    5182 coredump_hook.cc:603] RAW: Dumping core locally.
    @     0x7f914870c000  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f8e73790524,7f8dbcd60219,7f9148244d5f,7f8e7367207e,7f8e73661a2d,7f914855c6ad,7f914870bfff&map=bdcf1f91b8790c8a971d2904a194674945111543:7f8e6ef51000-7f8e7d553ac0,bd189fb7b9de62cf44fe27cae177f396:7f8daf756000-7f8dbcf71670 
E0226 17:22:24.806177    5143 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:22:24.806190    5143 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:22:24.806195    5143 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:22:24.806215    5143 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:22:24.806220    5143 coredump_hook.cc:603] RAW: Dumping core locally.
E0226 17:24:12.684363    5122 process_state.cc:783] RAW: Raising signal 11 with default behavior
https://symbolize.stripped_domain/r/?trace=7f0d36bea7b2,7f0d36c36d5f,1bd&map= 
E0226 17:24:19.102171    9260 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f0d36bea7b2 while already in FailureSignalHandler!
E0226 17:24:19.102188    9260 process_state.cc:1077] RAW: tid: 9260 raised new signal (old_tid: 5162)
https://symbolize.stripped_domain/r/?trace=7f91481f87b2,7f9148244d5f&map= 
E0226 17:24:19.104284    1072 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f91481f87b2 while already in FailureSignalHandler!
E0226 17:24:19.104320    1072 process_state.cc:1077] RAW: tid: 1072 raised new signal (old_tid: 5143)
https://symbolize.stripped_domain/r/?trace=7f6b490527b2,7f6b4909ed5f,1bd&map= 
E0226 17:24:19.104958    8369 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f6b490527b2 while already in FailureSignalHandler!
E0226 17:24:19.104998    8369 process_state.cc:1077] RAW: tid: 8369 raised new signal (old_tid: 5182)
https://symbolize.stripped_domain/r/?trace=7f976a6397b2,7f976a685d5f&map= 
E0226 17:24:19.105265     373 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f976a6397b2 while already in FailureSignalHandler!
E0226 17:24:19.105296     373 process_state.cc:1077] RAW: tid: 373 raised new signal (old_tid: 5259)
https://symbolize.stripped_domain/r/?trace=7fe021a31cbb,7fe3acea2d5f,7fe0219c67a7,7fe021a34a33,7fe0214874d3,7fe021486fab,7fe021486979,7fe0218f309e,7fe3ace4fea6&map=bd189fb7b9de62cf44fe27cae177f396:7fe0143ba000-7fe021bd5670 
https://symbolize.stripped_domain/r/?trace=E0226 17:24:19.105780   10617 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7fe021a31cbb while already in FailureSignalHandler!
E0226 17:24:19.105823   10617 process_state.cc:1077] RAW: tid: 10617 raised new signal (old_tid: 5244)
7f56a0bf5cbb,7f5a2c06dd5f,7f56a0b8a7a7,7f56a0bf8a33,7f56a064b4d3,7f56a064afab,7f56a064a979,7f56a0ab709e,7f5a2c01aea6&map=bd189fb7b9de62cf44fe27cae177f396:7f569357e000-7f56a0d99670 
E0226 17:24:19.105857   10728 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7f56a0bf5cbb while already in FailureSignalHandler!
E0226 17:24:19.105896   10728 process_state.cc:1077] RAW: tid: 10728 raised new signal (old_tid: 5238)
https://symbolize.stripped_domain/r/?trace=7fc4b802acbb,7fc84349cd5f,7fc4b7fbf7a7,7fc4b802da33,7fc4b7a804d3,7fc4b7a7ffab,7fc4b7a7f979,7fc4b7eec09e,7fc843449ea6&map=bd189fb7b9de62cf44fe27cae177f396:7fc4aa9b3000-7fc4b81ce670 
E0226 17:24:19.107415   11461 process_state.cc:1073] RAW: Signal 15 raised at PC: 0x7fc4b802acbb while already in FailureSignalHandler!
E0226 17:24:19.107458   11461 process_state.cc:1077] RAW: tid: 11461 raised new signal (old_tid: 5204)
E0226 17:24:20.037362    5162 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.085641    5259 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.102202    5143 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.105324    5182 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.105619    5244 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.108065    5204 process_state.cc:783] RAW: Raising signal 11 with default behavior
E0226 17:24:20.113516    4802 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0226 17:24:20.113546    4802 coredump_hook.cc:486] RAW: Called via ReportEvent and disabled coredump
E0226 17:24:20.113552    4802 client.cc:270] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0226 17:24:20.113555    4802 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0226 17:24:20.113584    4802 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0226 17:24:20.113589    4802 coredump_hook.cc:603] RAW: Dumping core locally.
E0226 17:24:20.119348    5238 process_state.cc:783] RAW: Raising signal 11 with default behavior
Traceback (most recent call last):
  File "/workspace/dolly-clm.py", line 102, in <module>
    xmp.spawn(_mp_fn, args=(args,), nprocs=args.num_cores)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 83, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 38, in spawn
    return pjrt.spawn(fn, nprocs, start_method, args)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 202, in spawn
    run_multiprocess(spawn_fn, start_method=start_method)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/runtime.py", line 83, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 159, in run_multiprocess
    replica_results = list(
  File "/usr/local/lib/python3.10/site-packages/torch_xla/_internal/pjrt.py", line 160, in <genexpr>
    itertools.chain.from_iterable(
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
root@t1v-n-108b165f-w-0:/workspace# /usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Expected behavior

The code should save the checkpoints successfully.

Environment

Reproducible on XLA backend [CPU/TPU/CUDA]: TPU
torch_xla version: 2.1.0

The text was updated successfully, but these errors were encountered:

JackCaoG · 2024-02-27T18:58:36Z

seems like it crashed in

xla/torch_xla/csrc/tensor_methods.cpp

Lines 354 to 370 in cb4983e

    
           void all_reduce(const std::vector<XLATensorPtr>& inputs, 
        
                           AllReduceType reduce_type, double scale, 
        
                           std::vector<std::vector<int64_t>> groups, bool pin_layout) { 
        
             std::vector<torch::lazy::Value> input_values; 
        
             input_values.reserve(inputs.size()); 
        
             for (auto& input : inputs) { 
        
               input_values.push_back(input->GetIrValue()); 
        
             } 
        
             torch::lazy::NodePtr node = torch::lazy::MakeNode<AllReduce>( 
        
                 reduce_type, input_values, GetAllReduceToken(inputs.front()->GetDevice()), 
        
                 scale, std::move(groups), pin_layout); 
        
             for (size_t i = 0; i < inputs.size(); ++i) { 
        
               inputs[i]->SetInPlaceIrValue(torch::lazy::Value(node, i)); 
        
             } 
        
             SetAllReduceToken(inputs.front()->GetDevice(), 
        
                               std::make_shared<torch::lazy::Value>(node, inputs.size())); 
        
           }

...

@will-cromar can you take a look?

shub-kris · 2024-02-27T22:24:31Z

@alanwaketan can you please also have a look here?

will-cromar · 2024-02-27T22:42:59Z

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and the Trainer-based examples had issues on TPU, but the accelerate-based examples did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

alanwaketan · 2024-02-27T23:05:02Z

I do believe the normal torch.save should be compatible with FSDP. cc @jonb377 who is our ckpt expert.

alanwaketan · 2024-02-27T23:05:39Z

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week, and the Trainer-based ones had issues on TPU, but the accelerate-based examples and scripts did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

Yea, I do. All the Llama and Gemma works are done with HF trainer. But I don't recall we hit this issue before.

alanwaketan · 2024-02-28T22:26:37Z

Okay, I just scanned through the script and it looks like it has nothing to do with SPMD @jonb377. It’s probably just simple DP… Have no ideas why this will crash but we probably won’t be able to spend too much time on debugging this given mp is about to deprecate.

Mon-ius · 2024-03-16T16:10:26Z

it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(

alanwaketan · 2024-03-18T22:56:52Z

it also crash normally with Phi2, and even SD, tested on the TPUv4-8 :(

Do you use DP or FSDP?

Mon-ius · 2024-03-19T00:18:49Z

hi @alanwaketan

I think it is high related to the HF accelerate lib, will continue verification

himekifee · 2024-05-17T10:35:41Z

@alanwaketan do you normally use the HuggingFace Trainer? I remember people have had issues using it with XLA before. I ran through two of the example tutorials last week while working on #6584, and the Trainer-based examples had issues on TPU, but the accelerate-based examples did work fine.

I tried to reproduce your crash on v4-8 with torch and torch_xla built from head and got a different crash: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space vmem. Used 32.45M of 16.00M vmem. Exceeded vmem capacity by 16.45M.

Hi. I encountered exact same issue as you did; even the vmem numbers are the exact same, and I tested with different llm with generate(), all hitting the same issue. Have you found a way to solve that?

baoleai · 2024-06-11T11:21:44Z

Hello, @shub-kris , I encountered a similar issue and have fixed it in huggingface/transformers#31264. Could you check if your issue has been resolved?

shub-kris changed the title ~~SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer~~ SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU Feb 27, 2024

baoleai mentioned this issue Jun 5, 2024

Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. huggingface/transformers#31264

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620

SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620

shub-kris commented Feb 27, 2024 •

edited by will-cromar

Loading

JackCaoG commented Feb 27, 2024

shub-kris commented Feb 27, 2024

will-cromar commented Feb 27, 2024 •

edited

Loading

alanwaketan commented Feb 27, 2024

alanwaketan commented Feb 27, 2024

alanwaketan commented Feb 28, 2024

Mon-ius commented Mar 16, 2024 •

edited

Loading

alanwaketan commented Mar 18, 2024

Mon-ius commented Mar 19, 2024

himekifee commented May 17, 2024

baoleai commented Jun 11, 2024

SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620

SIGSEGV: Segmentation Fault Memory error while checkpointing with transformers trainer on a v5-litepod-8 Google Cloud TPU #6620

Comments

shub-kris commented Feb 27, 2024 • edited by will-cromar Loading

🐛 Bug

To Reproduce

Entire Stack Trace

Expected behavior

Environment

JackCaoG commented Feb 27, 2024

shub-kris commented Feb 27, 2024

will-cromar commented Feb 27, 2024 • edited Loading

alanwaketan commented Feb 27, 2024

alanwaketan commented Feb 27, 2024

alanwaketan commented Feb 28, 2024

Mon-ius commented Mar 16, 2024 • edited Loading

alanwaketan commented Mar 18, 2024

Mon-ius commented Mar 19, 2024

himekifee commented May 17, 2024

baoleai commented Jun 11, 2024

shub-kris commented Feb 27, 2024 •

edited by will-cromar

Loading

will-cromar commented Feb 27, 2024 •

edited

Loading

Mon-ius commented Mar 16, 2024 •

edited

Loading