You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im using PyTorch estimator with FastFile mode (my dataset is ~15 TB). The job begins and then takes hours on "Downloading input data.." before timing out. Given that its FastFile mode, shouldn't the 'Downloading' portion be quite quick, given that we are no longer downloading the entire dataset upfront (like in File)?
Note: When I run this exact code on a smaller dataset without FastFile, everything works just fine.
python ./launch_estimator.py
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/xdg-ubuntu/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/skyler/.config/sagemaker/config.yaml
INFO:sagemaker.image_uris:image_uri is not presented, retrieving image_uri based on instance_type, framework etc.
INFO:sagemaker:Creating training-job with name: pytorch-training-2024-04-24-23-19-58-354
2024-04-24 23:21:35 Starting - Starting the training job...
2024-04-24 23:22:05 Starting - Preparing the instances for training......
2024-04-24 23:22:48 Downloading - Downloading input data................................................. ......................................................................................................... ....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
2024-04-25 02:12:57 Failed - Training job failed
..Traceback (most recent call last):
File "/home/skyler/Desktop/detr_fusion/./launch_estimator.py", line 88, in <module>
main()
File "/home/skyler/Desktop/detr_fusion/./launch_estimator.py", line 60, in main
estimator.fit(inputs=TrainingInput(
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/workflow/pipeline_context.py", line 346, in wrapper
return run_func(*args, **kwargs)
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/estimator.py", line 1341, in fit
self.latest_training_job.wait(logs=logs)
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/estimator.py", line 2677, in wait
self.sagemaker_session.logs_for_job(self.job_name, wait=True, log_type=logs)
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 5737, in logs_for_job
_logs_for_job(self, job_name, wait, poll, log_type, timeout)
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 7880, in _logs_for_job
_check_job_status(job_name, description, "TrainingJobStatus")
File "/home/skyler/leoai/lib/python3.10/site-packages/sagemaker/session.py", line 7933, in _check_job_status
raise exceptions.UnexpectedStatusException(
sagemaker.exceptions.UnexpectedStatusException: Error for Training job pytorch-training-2024-04-24-23-19-58-354: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Im using PyTorch estimator with FastFile mode (my dataset is ~15 TB). The job begins and then takes hours on "Downloading input data.." before timing out. Given that its FastFile mode, shouldn't the 'Downloading' portion be quite quick, given that we are no longer downloading the entire dataset upfront (like in File)?
Note: When I run this exact code on a smaller dataset without FastFile, everything works just fine.
The error is the following
SageMaker Python SDK version: 2.212.0
Pytorch Framework version: 2.2.0
Python version: 3.10
Custom Docker image (Y/N): N
Beta Was this translation helpful? Give feedback.
All reactions