Pytorch w/ FastFile mode -- Downloading input data takes unexpectedly long #4616

epirussky · 2024-04-25T18:29:06Z

epirussky
Apr 25, 2024

Im using PyTorch estimator with FastFile mode (my dataset is ~15 TB). The job begins and then takes hours on "Downloading input data.." before timing out. Given that its FastFile mode, shouldn't the 'Downloading' portion be quite quick, given that we are no longer downloading the entire dataset upfront (like in File)?

Note: When I run this exact code on a smaller dataset without FastFile, everything works just fine.

instance_type = "ml.g4dn.12xlarge"
role = ssm_credentials['role'] 

from sagemaker.pytorch import PyTorch    
estimator = PyTorch(
        role=role,
        instance_count=1,
        instance_type=instance_type,
        py_version="py310",
        framework_version="2.2.0",
        entry_point="main.py",
        source_dir="./",
        max_run=5*23*60*60,
        distribution={ "torch_distributed": { "enabled": True } },
        input_mode="FastFile",
    )
filepath = "s3://skybluez/traindata/"
"""launch training job"""
from sagemaker.inputs import TrainingInput
start_time = time.time()
estimator.fit(TrainingInput(filepath,  input_mode = 'FastFile'))

The error is the following

python ./launch_estimator.py 
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/xdg-ubuntu/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/skyler/.config/sagemaker/config.yaml
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-04-24-23-19-58-354
2024-04-24 23:21:35 Starting - Starting the training job...
2024-04-24 23:22:05 Starting - Preparing the instances for training......
2024-04-24 23:22:48 Downloading - Downloading input data.................................................          .........................................................................................................                    ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
2024-04-25 02:12:57 Failed - Training job failed
..Traceback (most recent call last):
  File "/home/skyler/Desktop/detr_fusion/./launch_estimator.py", line 88, in <module>
    main()
  File "/home/skyler/Desktop/detr_fusion/./launch_estimator.py", line 60, in main
    estimator.fit(inputs=TrainingInput(
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
    return run_func(*args, **kwargs)
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
    self.latest_training_job.wait(logs=logs)
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
    self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
    _logs_for_job(self, job_name, wait, poll, log_type, timeout)
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
    _check_job_status(job_name, description, "TrainingJobStatus")
  File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
    raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job pytorch-training-2024-04-24-23-19-58-354: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

SageMaker Python SDK version: 2.212.0
Pytorch Framework version: 2.2.0
Python version: 3.10
Custom Docker image (Y/N): N

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch w/ FastFile mode -- Downloading input data takes unexpectedly long #4616

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Pytorch w/ FastFile mode -- Downloading input data takes unexpectedly long #4616

epirussky Apr 25, 2024

Replies: 0 comments

epirussky
Apr 25, 2024