[Train] `AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43` during checkpoint reporting #41137

justinvyu · 2023-11-14T22:34:47Z

What happened + What you expected to happen

This pyarrow bug shows up flakily when doing the existence check for this empty marker file. This file is used to determine whether all nodes are writing to a shared storage.

The failure happens flakily for s3 storage with the default pyarrow filesystem.

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/session.py", line 587, in new_report
    persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 549, in persist_current_checkpoint
    self._check_validation_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 508, in _check_validation_file
    if not _exists_at_fs_path(fs=self.storage_filesystem, fs_path=valid_file):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 255, in _exists_at_fs_path
    valid = fs.get_file_info(fs_path)
  File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for key 'euser_w4ej7snhlu6itki2li9gdl5j1k/TorchTrainer_2023-11-14_07-12-43/.validate_storage_marker' in bucket 'endpoints-fine-tuning-artifacts-staging': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument

Versions / Dependencies

2.7, 2.8

Reproduction script

TODO

Issue Severity

High: It blocks me from completing my task.

justinvyu · 2023-11-15T00:09:33Z

apache/arrow#35365, apache/arrow#36272

justinvyu · 2023-11-17T00:43:53Z

Here's what we tried:

Patch the storage context to save the file with some contents. Our hypothesis was the empty file causing some issues. [Rejected ❌]
Reinit the filesystem storage.storage_filesystem = pyarrow.fs.S3FileSystem(). Our hypothesis was that the pickled version of the pyarrow filesystem from the head node (on AWS) is incompatible on the worker node. The issue was reproducible even with creating a new instance of pyarrow.fs.S3FileSystem() in the debug session. [Rejected ❌]
Try some different s3 operations on the pyarrow filesystem. Once it fails with the error, running any other commands in the same process (via a debug session) will result in the same error.
Use s3fs instead of pyarrow s3. This worked in 5/5 runs, so we can workaround with it. [Workaround ✅ ]
Increasing the max attempts of the pyarrow s3 filesystem, but this just led to the command hanging due to retrying and
failing with the same error. The hypothesis was that maybe it doesn't get enough retries. [Rejected ❌]
We tried to patch the ray code to set the pyarrow s3 filesystem logging level, but this didn't work. TODO: We are putting a pin on this until pyarrow 15 gets released with an environment variable that can help debug the problem. (see GH-35260: [C++][Python][R] Allow users to adjust S3 log level by environment variable apache/arrow#38267)

ykoyfman · 2023-12-04T22:16:21Z

@justinvyu What version of pyarrow are you seeing this with? We have also been running into this error with pyarrow 11.0-13.0. (Not with Ray train, but other Ray jobs.)

justinvyu · 2023-12-05T00:06:52Z

These experiments were done with the latest version as of 11/14/23 which was pyarrow 14.

@ykoyfman Have you also been seeing that the failure is flaky, but once it happens, every subsequent pyarrow fs call will give the same error?

What other Ray jobs are you running? And do you use pyarrow directly or is pyarrow used implicitly in the implementation of Ray (ex: Ray Data)?

ykoyfman · 2023-12-05T00:39:29Z

pyarrow 14.

Thanks @justinvyu - this is good to know!

Have you also been seeing that the failure is flaky, but once it happens, every subsequent pyarrow fs call will give the same error?

Yes, this is the behavior we observed as well.

We have a variety of workloads distributed via Ray actors and tasks and a combination of both direct pyarrow and ray.data - this error primarily occurs with direct pyarrow access.

justinvyu · 2023-12-07T21:33:33Z

cc @ericl @kouroshHakha

EthanMarx · 2024-02-14T17:19:08Z

I am encountering potentially related error when running a Tune job. This also happens very flakily. This happens for some portion of my trials, most of which were successfully saving checkpoints up until the point of failure.

File "/usr/local/lib/python3.10/site-packages/ray/train/lightning/_lightning_utils.py", line 270, in on_train_epoch_end      
(TunerInternal pid=2285)     train.report(metrics=metrics, checkpoint=checkpoint)                                                                       
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 644, in wrapper                          
(TunerInternal pid=2285)     return fn(*args, **kwargs)                                                                                                 
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 706, in report                           
(TunerInternal pid=2285)     _get_session().report(metrics, checkpoint=checkpoint)                                                                      
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 417, in report                           
(TunerInternal pid=2285)     persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)                                                 
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 558, in persist_current_checkpoint       
(TunerInternal pid=2285)     _pyarrow_fs_copy_files(                                                                                                    
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files           
(TunerInternal pid=2285)     return pyarrow.fs.copy_files(                                                                                              
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/pyarrow/fs.py", line 272, in copy_files                                        
(TunerInternal pid=2285)     _copy_files_selector(source_fs, source_sel,                                                                                
(TunerInternal pid=2285)   File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector                                                       
(TunerInternal pid=2285)   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status  
OSError: When uploading part for key 'helm-asha-3/TorchTrainer_2024-02-14_16-02-39/TorchTrainer_7f103_00001_1_data_mute_prob=0.
0592,data_swap_prob=0.1236,data_waveform_prob=0.6707,model_learning_rate=0.0707,model__2024-02-14_16-03-03/checkpoint_000004/checkpoint.ckpt' in bucket 
'aframe-test': AWS Error ACCESS_DENIED during UploadPart operation

What is the quickest way to workaround this / is this being investigated? It is mentioned that directly using s3fs resolves the problem. How can I setup my tune run to do so? Thanks

justinvyu · 2024-04-17T21:58:11Z

@EthanMarx Sorry for missing this -- take a look at this section of the user guide: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html#fsspec-filesystems

justinvyu · 2024-12-02T19:37:35Z

I think this may be fixed in pyarrow 15+ as seen from my investigation here: https://github.com/anyscale/product/issues/25536#issuecomment-2067338008

Closing for now.

justinvyu added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue labels Nov 14, 2023

justinvyu self-assigned this Nov 14, 2023

justinvyu changed the title ~~[Train]~~ [Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43 during checkpoint reporting Nov 14, 2023

shomilj mentioned this issue Nov 16, 2023

libcurl error on scanning huge pyarrow dataset over s3 pola-rs/polars#9505

Closed

2 tasks

justinvyu added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Nov 17, 2023

woshiyyya mentioned this issue Apr 23, 2024

[Ray Tune/ Train] Auth with aws_web_identity_token or use the provided file system provider in runtime config #44881

Open

justinvyu closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] `AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43` during checkpoint reporting #41137

[Train] `AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43` during checkpoint reporting #41137

justinvyu commented Nov 14, 2023

justinvyu commented Nov 15, 2023 •

edited

Loading

justinvyu commented Nov 17, 2023 •

edited by kouroshHakha

Loading

ykoyfman commented Dec 4, 2023

justinvyu commented Dec 5, 2023

ykoyfman commented Dec 5, 2023

justinvyu commented Dec 7, 2023

EthanMarx commented Feb 14, 2024 •

edited

Loading

justinvyu commented Apr 17, 2024

justinvyu commented Dec 2, 2024

[Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43 during checkpoint reporting #41137

[Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43 during checkpoint reporting #41137

Comments

justinvyu commented Nov 14, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

justinvyu commented Nov 15, 2023 • edited Loading

justinvyu commented Nov 17, 2023 • edited by kouroshHakha Loading

ykoyfman commented Dec 4, 2023

justinvyu commented Dec 5, 2023

ykoyfman commented Dec 5, 2023

justinvyu commented Dec 7, 2023

EthanMarx commented Feb 14, 2024 • edited Loading

justinvyu commented Apr 17, 2024

justinvyu commented Dec 2, 2024

[Train] `AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43` during checkpoint reporting #41137

[Train] `AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43` during checkpoint reporting #41137

justinvyu commented Nov 15, 2023 •

edited

Loading

justinvyu commented Nov 17, 2023 •

edited by kouroshHakha

Loading

EthanMarx commented Feb 14, 2024 •

edited

Loading