Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43 during checkpoint reporting #41137

Closed
justinvyu opened this issue Nov 14, 2023 · 9 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue

Comments

@justinvyu
Copy link
Contributor

What happened + What you expected to happen

This pyarrow bug shows up flakily when doing the existence check for this empty marker file. This file is used to determine whether all nodes are writing to a shared storage.

The failure happens flakily for s3 storage with the default pyarrow filesystem.

  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/session.py", line 587, in new_report
    persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 549, in persist_current_checkpoint
    self._check_validation_file()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 508, in _check_validation_file
    if not _exists_at_fs_path(fs=self.storage_filesystem, fs_path=valid_file):
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 255, in _exists_at_fs_path
    valid = fs.get_file_info(fs_path)
  File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for key 'euser_w4ej7snhlu6itki2li9gdl5j1k/TorchTrainer_2023-11-14_07-12-43/.validate_storage_marker' in bucket 'endpoints-fine-tuning-artifacts-staging': AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43, A libcurl function was given a bad argument

Versions / Dependencies

2.7, 2.8

Reproduction script

TODO

Issue Severity

High: It blocks me from completing my task.

@justinvyu justinvyu added bug Something that is supposed to be working; but isn't P0 Issues that should be fixed in short order train Ray Train Related Issue labels Nov 14, 2023
@justinvyu justinvyu self-assigned this Nov 14, 2023
@justinvyu justinvyu changed the title [Train] [Train] AWS Error NETWORK_CONNECTION during HeadObject operation: curlCode: 43 during checkpoint reporting Nov 14, 2023
@justinvyu
Copy link
Contributor Author

justinvyu commented Nov 15, 2023

@justinvyu
Copy link
Contributor Author

justinvyu commented Nov 17, 2023

Here's what we tried:

  1. Patch the storage context to save the file with some contents. Our hypothesis was the empty file causing some issues. [Rejected ❌]
  2. Reinit the filesystem storage.storage_filesystem = pyarrow.fs.S3FileSystem(). Our hypothesis was that the pickled version of the pyarrow filesystem from the head node (on AWS) is incompatible on the worker node. The issue was reproducible even with creating a new instance of pyarrow.fs.S3FileSystem() in the debug session. [Rejected ❌]
  3. Try some different s3 operations on the pyarrow filesystem. Once it fails with the error, running any other commands in the same process (via a debug session) will result in the same error.
  4. Use s3fs instead of pyarrow s3. This worked in 5/5 runs, so we can workaround with it. [Workaround ✅ ]
  5. Increasing the max attempts of the pyarrow s3 filesystem, but this just led to the command hanging due to retrying and
    failing with the same error. The hypothesis was that maybe it doesn't get enough retries. [Rejected ❌]
  6. We tried to patch the ray code to set the pyarrow s3 filesystem logging level, but this didn't work. TODO: We are putting a pin on this until pyarrow 15 gets released with an environment variable that can help debug the problem. (see GH-35260: [C++][Python][R] Allow users to adjust S3 log level by environment variable apache/arrow#38267)

@justinvyu justinvyu added P1 Issue that should be fixed within a few weeks and removed P0 Issues that should be fixed in short order labels Nov 17, 2023
@ykoyfman
Copy link

ykoyfman commented Dec 4, 2023

@justinvyu What version of pyarrow are you seeing this with? We have also been running into this error with pyarrow 11.0-13.0. (Not with Ray train, but other Ray jobs.)

@justinvyu
Copy link
Contributor Author

These experiments were done with the latest version as of 11/14/23 which was pyarrow 14.

@ykoyfman Have you also been seeing that the failure is flaky, but once it happens, every subsequent pyarrow fs call will give the same error?

What other Ray jobs are you running? And do you use pyarrow directly or is pyarrow used implicitly in the implementation of Ray (ex: Ray Data)?

@ykoyfman
Copy link

ykoyfman commented Dec 5, 2023

pyarrow 14.

Thanks @justinvyu - this is good to know!

Have you also been seeing that the failure is flaky, but once it happens, every subsequent pyarrow fs call will give the same error?

Yes, this is the behavior we observed as well.

We have a variety of workloads distributed via Ray actors and tasks and a combination of both direct pyarrow and ray.data - this error primarily occurs with direct pyarrow access.

@justinvyu
Copy link
Contributor Author

cc @ericl @kouroshHakha

@EthanMarx
Copy link

EthanMarx commented Feb 14, 2024

I am encountering potentially related error when running a Tune job. This also happens very flakily. This happens for some portion of my trials, most of which were successfully saving checkpoints up until the point of failure.

File "/usr/local/lib/python3.10/site-packages/ray/train/lightning/_lightning_utils.py", line 270, in on_train_epoch_end      
(TunerInternal pid=2285)     train.report(metrics=metrics, checkpoint=checkpoint)                                                                       
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 644, in wrapper                          
(TunerInternal pid=2285)     return fn(*args, **kwargs)                                                                                                 
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 706, in report                           
(TunerInternal pid=2285)     _get_session().report(metrics, checkpoint=checkpoint)                                                                      
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/session.py", line 417, in report                           
(TunerInternal pid=2285)     persisted_checkpoint = self.storage.persist_current_checkpoint(checkpoint)                                                 
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 558, in persist_current_checkpoint       
(TunerInternal pid=2285)     _pyarrow_fs_copy_files(                                                                                                    
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/ray/train/_internal/storage.py", line 110, in _pyarrow_fs_copy_files           
(TunerInternal pid=2285)     return pyarrow.fs.copy_files(                                                                                              
(TunerInternal pid=2285)   File "/usr/local/lib/python3.10/site-packages/pyarrow/fs.py", line 272, in copy_files                                        
(TunerInternal pid=2285)     _copy_files_selector(source_fs, source_sel,                                                                                
(TunerInternal pid=2285)   File "pyarrow/_fs.pyx", line 1627, in pyarrow._fs._copy_files_selector                                                       
(TunerInternal pid=2285)   File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status  
OSError: When uploading part for key 'helm-asha-3/TorchTrainer_2024-02-14_16-02-39/TorchTrainer_7f103_00001_1_data_mute_prob=0.
0592,data_swap_prob=0.1236,data_waveform_prob=0.6707,model_learning_rate=0.0707,model__2024-02-14_16-03-03/checkpoint_000004/checkpoint.ckpt' in bucket 
'aframe-test': AWS Error ACCESS_DENIED during UploadPart operation

What is the quickest way to workaround this / is this being investigated? It is mentioned that directly using s3fs resolves the problem. How can I setup my tune run to do so? Thanks

@justinvyu
Copy link
Contributor Author

@EthanMarx Sorry for missing this -- take a look at this section of the user guide: https://docs.ray.io/en/latest/train/user-guides/persistent-storage.html#fsspec-filesystems

@justinvyu
Copy link
Contributor Author

I think this may be fixed in pyarrow 15+ as seen from my investigation here: https://github.com/anyscale/product/issues/25536#issuecomment-2067338008

Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks train Ray Train Related Issue
Projects
None yet
Development

No branches or pull requests

3 participants