-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43
during checkpoint reporting
#41137
Comments
AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43
during checkpoint reporting
Here's what we tried:
|
@justinvyu What version of pyarrow are you seeing this with? We have also been running into this error with pyarrow 11.0-13.0. (Not with Ray train, but other Ray jobs.) |
These experiments were done with the latest version as of 11/14/23 which was pyarrow 14. @ykoyfman Have you also been seeing that the failure is flaky, but once it happens, every subsequent pyarrow fs call will give the same error? What other Ray jobs are you running? And do you use pyarrow directly or is pyarrow used implicitly in the implementation of Ray (ex: Ray Data)? |
Thanks @justinvyu - this is good to know!
Yes, this is the behavior we observed as well. We have a variety of workloads distributed via Ray actors and tasks and a combination of both direct pyarrow and ray.data - this error primarily occurs with direct pyarrow access. |
I am encountering potentially related error when running a File "/usr/local/lib/python3.10/site-packages/ray/train/lightning/_lightning_utils.py", line 270, in on_train_epoch_end
(TunerInternal pid=2285) train.report(metrics=metrics, checkpoint=checkpoint)
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 644, in wrapper
(TunerInternal pid=2285) return fn(*args, **kwargs)
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 706, in report
(TunerInternal pid=2285) _get_session().report(metrics, checkpoint=checkpoint)
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 417, in report
(TunerInternal pid=2285) persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 558, in persist_current_checkpoint
(TunerInternal pid=2285) _pyarrow_fs_copy_files(
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files
(TunerInternal pid=2285) return pyarrow.fs.copy_files(
(TunerInternal pid=2285) File "/usr/local/lib/python3.10/site-packages/pyarrow/fs.py", line 272, in copy_files
(TunerInternal pid=2285) _copy_files_selector(source_fs, source_sel,
(TunerInternal pid=2285) File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector
(TunerInternal pid=2285) File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: When uploading part for key 'helm-asha-3/TorchTrainer_2024-02-14_16-02-39/TorchTrainer_7f103_00001_1_data_mute_prob=0.
0592,data_swap_prob=0.1236,data_waveform_prob=0.6707,model_learning_rate=0.0707,model__2024-02-14_16-03-03/checkpoint_000004/checkpoint.ckpt' in bucket
'aframe-test': AWS Error ACCESS_DENIED during UploadPart operation What is the quickest way to workaround this / is this being investigated? It is mentioned that directly using |
@EthanMarx Sorry for missing this -- take a look at this section of the user guide: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html#fsspec-filesystems |
I think this may be fixed in pyarrow 15+ as seen from my investigation here: https://github.com/anyscale/product/issues/25536#issuecomment-2067338008 Closing for now. |
What happened + What you expected to happen
This pyarrow bug shows up flakily when doing the existence check for this empty marker file. This file is used to determine whether all nodes are writing to a shared storage.
The failure happens flakily for s3 storage with the default pyarrow filesystem.
Versions / Dependencies
2.7, 2.8
Reproduction script
TODO
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: