-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
help: worker hangs while downloading many files (multi GB) in deadline cloud SMF #546
Comments
Hello, thanks for cutting this issue. Could you provide more details about the environment where you saw this behavior?
|
Hey @RaiaN, Thanks for reporting your issue here. Let's get to the bottom of it... I see you are downloading files to Does your workflow attempt to clean up these downloaded files? If not then it is possible that the worker's file-system is filling up to the point where there is no disk space available and disrupting the worker. The worker emits host metrics in its logs, for example:
You can get to the worker logs for a task that has been run by right-clicking on the task and clicking View worker logs as shown in the screenshot below: Would you be able to look for the disk metrics and report them here to help (dis)prove my theory? |
Hey @RaiaN, have you had a chance to look into the theory about running out of disk space? |
Hey @jusiskin @ttblanchard Thank you for response. This is CMF, Windows Machines, each is 60 vCPU cores, 64 GB RAM, RTX 4090, 500 GB disk space.
Yes, there is plenty of disk space. S3 directory that each worker downloads is under 2GB.
This shouldn't be the issue. I confirm that this is not the issue as there is plenty of disk space on each worker node (at least 300 GB is free)
When I logged in into the render node I've noticed that disk activity is almost non existent though Rendering jobs logs were simply saying Workaround I've found was this: add |
Yes, the download location is outside session directory. We want it to be somewhat persistent. |
Ok, found logs that might be useful:
Worker logs (time interval from 2025-02-13 13:20:21 to 2025-02-13 13:32:21:
|
Managed to reproduced it again. Using --quiet works as a workaround. Frist worker task logs:
Second worker task logs:
Worker logs (first):
Worker logs (second):
|
Our of curiosity I've replaced existing steps with just one to directly call
Without |
Download step still going on (for ~23 hours):
|
Interesting find about the In this issue, you've mentioned seeing output like:
I'm wondering how many lines you see like this you are seeing in the task log. The last timestamp you've pasted here 3 hours ago shows yesterday's date. If that was a recent log message then it appears the worker agent is still catching up to upload the logs to CloudWatch. For a bit of context here, the worker agent synchronizes the standard output from the running command to CloudWatch logs. It currently buffers the logs in memory while they are waiting be synchronized to CloudWatch. If the volume of log output is large enough, the physical RAM on the worker host can fill up. Depending on the OS configuration of your worker host, it will go to swap (disk-backed memory) which would degrade performance like you are seeing. To help get an idea of the volume of logs we're dealing with, you could run the |
Yes, there are NO new lines in the logs.
I don't believe this is the case because
Ahh! That might be the issue here, probably due to some kind of upload buffer size being used internally? Let me check the file size of this operation: |
Ok, The download procedure definitely doesn't go use swap file. Used RAM is also < 15%. |
Another possibility is that your AWS account's service quota limit for
If you have a fleet of many workers, each with a high volume of logs from this If the jobs are still running or were running within the last 3 hours, you can confirm this by:
|
@jusiskin Re quotas, it is 5000 / second indeed. However, what is not clear to me is how so Deadline worker can just be idle for so long and don't do any checks wrt progress aka Timeout?
Nah, this happens at a single worker at a time, other ones are idle |
Describe the bug
Using this simple code to download data from S3 during single step (deadline job bundle)
Command execution hangs at:
Expected Behaviour
Command finishes successfully same command but when run from Powershell without deadline cloud worker context simply doesn't hang.
Current Behaviour
The following command hangs when run in under deadline cloud worker context (python script as step in deadline job bundle)
aws s3 cp <A> <B> --recursive
Command execution hangs:
Reproduction Steps
aws s3 cp <A> <B> --recursive
+ use s3 bucket with at least 2GB of data in it + many many small files (like 1000 or 2000 files).Environment
At minimum:
Job bundle template:
The text was updated successfully, but these errors were encountered: