Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

help: worker hangs while downloading many files (multi GB) in deadline cloud SMF #546

Open
RaiaN opened this issue Feb 13, 2025 · 15 comments
Labels
help wanted Extra attention is needed needs triage A new issue that needs a first look question Further information is requested response-requested A response from the contributor has been requested.

Comments

@RaiaN
Copy link

RaiaN commented Feb 13, 2025

Describe the bug

Using this simple code to download data from S3 during single step (deadline job bundle)

 command = [
        "aws", "s3", "cp",
        s3_path,
        target_dir, "--recursive"
    ]

    try:
        print_log(f"Downloading data from s3 {s3_path} to {target_dir}")
        result = subprocess.run(
            command,
            check=True,
            shell=True,
            capture_output=True,
            text=True
        )
        print_log(f"Command output:\n{result.stdout}")
        if result.stderr:
            print_log(f"Command errors:\n{result.stderr}")
    except subprocess.CalledProcessError as e:
        print_log(f"Error running command: {e}")
        print_log(f"Command output:\n{e.stdout}")
        print_log(f"Command errors:\n{e.stderr}")
        return False
    except FileNotFoundError as fne:
        print_log(f"Failed to download data from s3. Check the path: {s3_path}")
        print_log(f"Error: {fne}")
        return False

Command execution hangs at:

2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining

Expected Behaviour

Command finishes successfully same command but when run from Powershell without deadline cloud worker context simply doesn't hang.

Current Behaviour

The following command hangs when run in under deadline cloud worker context (python script as step in deadline job bundle)
aws s3 cp <A> <B> --recursive

Command execution hangs:

2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.6 MiB/s) with 33 file(s) remaining
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining

Reproduction Steps

aws s3 cp <A> <B> --recursive + use s3 bucket with at least 2GB of data in it + many many small files (like 1000 or 2000 files).

Environment

At minimum:

  1. Operating system - Windows
  2. Python
  3. Latest version of deadline cloud worker (python library)

Job bundle template:

steps:
- name: Render Scene
  script:
    actions:
      onRun:
        command: python
        args: ['{{Param.LauncherFile}}']
@RaiaN RaiaN added bug Something isn't working needs triage A new issue that needs a first look labels Feb 13, 2025
@ttblanchard
Copy link
Contributor

Hello, thanks for cutting this issue. Could you provide more details about the environment where you saw this behavior?

  • Is this a Customer Managed Fleet (CMF) or Service Managed Fleet (SMF)?
  • Can you confirm that there is enough space on disk for the downloaded files?

@jusiskin
Copy link
Contributor

Hey @RaiaN,

Thanks for reporting your issue here. Let's get to the bottom of it...

I see you are downloading files to target_dir. I suspect that this is a location somewhere on the worker file-system that is outside of the session working directory (docs), but could you confirm this?

Does your workflow attempt to clean up these downloaded files? If not then it is possible that the worker's file-system is filling up to the point where there is no disk space available and disrupting the worker.

The worker emits host metrics in its logs, for example:

{"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "4.5", "memory-total-bytes": "8214212608", "memory-used-bytes": "505384960", "memory-used-percent": "6.2", "swap-used-bytes": "0", "total-disk-bytes": "268355743744", "total-disk-used-bytes": "5139046400", "total-disk-used-percent": "0.0", "user-disk-available-bytes": "263216697344", "network-sent-bytes-per-second": "2355", "network-recv-bytes-per-second": "2446", "disk-read-bytes-per-second": "7730150", "disk-write-bytes-per-second": "776064" }

You can get to the worker logs for a task that has been run by right-clicking on the task and clicking View worker logs as shown in the screenshot below:

view worker logs for task

Would you be able to look for the disk metrics and report them here to help (dis)prove my theory?

@jusiskin jusiskin added the help wanted Extra attention is needed label Feb 13, 2025
@jusiskin jusiskin changed the title Bug: aws s3 cp hangs for multi GB downloads (many small files) and when run from under deadline cloud worker account help: aws s3 cp hangs for multi GB downloads (many small files) and when run from under deadline cloud worker account Feb 13, 2025
@jusiskin jusiskin added question Further information is requested response-requested A response from the contributor has been requested. and removed bug Something isn't working labels Feb 13, 2025
@jusiskin jusiskin changed the title help: aws s3 cp hangs for multi GB downloads (many small files) and when run from under deadline cloud worker account help: worker hangs while downloading many files (multi GB) in deadline cloud SMF Feb 13, 2025
@jusiskin
Copy link
Contributor

Hey @RaiaN, have you had a chance to look into the theory about running out of disk space?

@RaiaN
Copy link
Author

RaiaN commented Feb 17, 2025

Hey @jusiskin @ttblanchard Thank you for response.

This is CMF, Windows Machines, each is 60 vCPU cores, 64 GB RAM, RTX 4090, 500 GB disk space.

Can you confirm that there is enough space on disk for the downloaded files?

Yes, there is plenty of disk space. S3 directory that each worker downloads is under 2GB.

Does your workflow attempt to clean up these downloaded files? If not then it is possible that the worker's file-system is filling up to the point where there is no disk space available and disrupting the worker.

This shouldn't be the issue. I confirm that this is not the issue as there is plenty of disk space on each worker node (at least 300 GB is free)

The worker emits host metrics in its logs, for example:

When I logged in into the render node I've noticed that disk activity is almost non existent though Rendering jobs logs were simply saying
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining

Workaround I've found was this: add --quiet flag to supress all aws s3 cp output and it worked! Does it mean some kind of connection issue that the worker node was having?

@RaiaN
Copy link
Author

RaiaN commented Feb 17, 2025

I suspect that this is a location somewhere on the worker file-system that is outside of the session working directory (docs), but could you confirm this?

Yes, the download location is outside session directory. We want it to be somewhat persistent.

@RaiaN
Copy link
Author

RaiaN commented Feb 17, 2025

Ok, found logs that might be useful:

2025/02/13 12:49:15+00:00 Completed 1.1 GiB/1.1 GiB (53.3 MiB/s) with 36 file(s) remaining
2025/02/13 13:22:57+00:00 Canceling subprocess 5644 via notify then terminate method at 2025-02-13T13:22:57Z.

Worker logs (time interval from 2025-02-13 13:20:21 to 2025-02-13 13:32:21:

[2025-02-13 13:20:21,604][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 0.0 memory-total-bytes 64384258048 memory-used-bytes 7195947008 memory-used-percent 11.2 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 338751631360 total-disk-used-percent 0.4 user-disk-available-bytes 573806215168 network-sent-bytes-per-second 6270 network-recv-bytes-per-second 4733 disk-read-bytes-per-second 176426863 disk-write-bytes-per-second 1026510532
[2025-02-13 13:20:32,622][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=721f792f-de4a-4d23-9e0e-1fe7b6b302e7
[2025-02-13 13:20:47,783][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=10fad4e5-28c9-4b55-b2c7-e9511b433039
[2025-02-13 13:21:02,935][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=69d85037-c21f-4a16-a7d5-aa28469194d8
[2025-02-13 13:21:18,659][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=45892dda-ae30-4958-be5c-8e103482fc54
[2025-02-13 13:21:21,632][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 0.0 memory-total-bytes 64384258048 memory-used-bytes 7728291840 memory-used-percent 12.0 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 338846040064 total-disk-used-percent 0.4 user-disk-available-bytes 573711806464 network-sent-bytes-per-second 10321 network-recv-bytes-per-second 54722 disk-read-bytes-per-second 176840917 disk-write-bytes-per-second 1028781440
[2025-02-13 13:21:33,834][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=4ad0f02d-8015-44f4-b2b2-1be6a955ee5e
[2025-02-13 13:21:48,983][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=7f6f2f17-192f-4ce0-9a55-8706b2a15910
[2025-02-13 13:22:04,140][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=0637aba5-5ee2-40c2-a02b-d59856bd31ab
[2025-02-13 13:22:19,300][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=0627d8ed-aa3a-4324-9cb3-bd1c8fd21596
[2025-02-13 13:22:21,646][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.2 memory-total-bytes 64384258048 memory-used-bytes 7154647040 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339004522496 total-disk-used-percent 0.4 user-disk-available-bytes 573553324032 network-sent-bytes-per-second 6767 network-recv-bytes-per-second 4387 disk-read-bytes-per-second 177062110 disk-write-bytes-per-second 1033199258
[2025-02-13 13:22:34,470][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=0d6bd5c8-4290-4b79-bb13-e19a29263dfe
[2025-02-13 13:22:49,645][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=6f1bd23e-7c33-49f4-b65f-8b8ff6f64b23
[2025-02-13 13:23:04,786][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=556ff5b6-54ad-4bd6-b456-f65c2761fa37
[2025-02-13 13:23:19,943][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e9c6e695-b1f7-47f2-af0b-a34265cc305e
[2025-02-13 13:23:21,670][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.6 memory-total-bytes 64384258048 memory-used-bytes 7135158272 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339004649472 total-disk-used-percent 0.4 user-disk-available-bytes 573553197056 network-sent-bytes-per-second 3384 network-recv-bytes-per-second 4361 disk-read-bytes-per-second 177066240 disk-write-bytes-per-second 1033251277
[2025-02-13 13:23:35,122][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e03b33cb-f979-4455-87b3-0a621534c45d
[2025-02-13 13:23:50,466][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=c2dc9e37-5499-4f67-815a-28f79637cd7f
[2025-02-13 13:24:05,627][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=7e965081-b557-4fca-9317-330be5be8038
[2025-02-13 13:24:20,794][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=542ffd5d-b61f-49ab-9711-f82f21f3baa0
[2025-02-13 13:24:21,692][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.6 memory-total-bytes 64384258048 memory-used-bytes 7132065792 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339004710912 total-disk-used-percent 0.4 user-disk-available-bytes 573553135616 network-sent-bytes-per-second 6850 network-recv-bytes-per-second 4424 disk-read-bytes-per-second 177067059 disk-write-bytes-per-second 1033274633
[2025-02-13 13:24:35,957][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=c61672d2-8bad-4da5-8c43-5de644b97574
[2025-02-13 13:24:51,112][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=80ed25d1-9b60-49ef-962e-cb99c673c47e
[2025-02-13 13:25:06,273][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=ec3707fb-6864-4b1c-8d0e-93dc687cba6a
[2025-02-13 13:25:21,422][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=032210b2-d360-4cc0-a5a4-45df77675c22
[2025-02-13 13:25:21,708][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.5 memory-total-bytes 64384258048 memory-used-bytes 7124414464 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339004706816 total-disk-used-percent 0.4 user-disk-available-bytes 573553139712 network-sent-bytes-per-second 3228 network-recv-bytes-per-second 4198 disk-read-bytes-per-second 177067059 disk-write-bytes-per-second 1033292774
[2025-02-13 13:25:36,608][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=cf9001a1-3a66-4c09-a7de-3cd3791a0ed9
[2025-02-13 13:25:51,748][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=3876d092-1676-4bc5-bc1a-8818ee5d5968
[2025-02-13 13:26:06,899][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e66d912c-60bb-4fea-9ba4-107089c69d1a
[2025-02-13 13:26:21,722][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.3 memory-total-bytes 64384258048 memory-used-bytes 7131783168 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339004772352 total-disk-used-percent 0.4 user-disk-available-bytes 573553074176 network-sent-bytes-per-second 6712 network-recv-bytes-per-second 4273 disk-read-bytes-per-second 177068151 disk-write-bytes-per-second 1033316471
[2025-02-13 13:26:22,048][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=49193ab5-86f1-4cb6-8c0a-4b30ae01039f
[2025-02-13 13:26:37,216][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=21349995-e46e-4e9f-bf87-78bb7c59b12d
[2025-02-13 13:26:52,374][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=01059f60-a244-4c6e-b0b2-9949bc1ae918
[2025-02-13 13:27:07,537][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=d98fd1a7-9c9f-42a4-9c57-3016d76a5969
[2025-02-13 13:27:21,739][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.4 memory-total-bytes 64384258048 memory-used-bytes 7127707648 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003326464 total-disk-used-percent 0.4 user-disk-available-bytes 573554520064 network-sent-bytes-per-second 3195 network-recv-bytes-per-second 4157 disk-read-bytes-per-second 177070609 disk-write-bytes-per-second 1033361408
[2025-02-13 13:27:22,697][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e3489b63-eba8-42b8-ba05-fff379126ab2
[2025-02-13 13:27:37,876][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=7a2405cc-ef2c-41e7-ad71-f6c484a397b9
[2025-02-13 13:27:53,031][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=afda3598-b6bb-498e-b5b9-57f70d6381ae
[2025-02-13 13:28:08,192][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=3a3fa295-9da4-4532-9c3e-ab1dcd12fb0c
[2025-02-13 13:28:21,752][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.3 memory-total-bytes 64384258048 memory-used-bytes 7135391744 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003256832 total-disk-used-percent 0.4 user-disk-available-bytes 573554589696 network-sent-bytes-per-second 6769 network-recv-bytes-per-second 4339 disk-read-bytes-per-second 177073886 disk-write-bytes-per-second 1033433728
[2025-02-13 13:28:23,369][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=24a5189a-efdd-4658-bc43-c46970f28c12
[2025-02-13 13:28:38,529][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=c50be7b6-be34-4666-b317-b6f8685f01be
[2025-02-13 13:28:53,697][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=f7b51a20-d230-4a62-b4bf-5c62c4955245
[2025-02-13 13:29:08,850][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=6d68cbb9-cc43-4e83-8d10-b924668dce17
[2025-02-13 13:29:21,775][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.3 memory-total-bytes 64384258048 memory-used-bytes 7136669696 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003322368 total-disk-used-percent 0.4 user-disk-available-bytes 573554524160 network-sent-bytes-per-second 3215 network-recv-bytes-per-second 4191 disk-read-bytes-per-second 177074705 disk-write-bytes-per-second 1033452365
[2025-02-13 13:29:24,297][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=196daccc-5fc9-47e0-93e4-4aa7e72f1fc5
[2025-02-13 13:29:39,477][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=d6bedb64-4613-4ce5-94d7-608a44e49251
[2025-02-13 13:29:54,627][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e509179c-5013-40b8-b7da-aff5780dbd1e
[2025-02-13 13:30:09,791][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=077f6f4b-8b30-4c7c-8ca0-4cf4ab6a1302
[2025-02-13 13:30:21,802][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.5 memory-total-bytes 64384258048 memory-used-bytes 7142764544 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003449344 total-disk-used-percent 0.4 user-disk-available-bytes 573554397184 network-sent-bytes-per-second 6754 network-recv-bytes-per-second 4349 disk-read-bytes-per-second 177076070 disk-write-bytes-per-second 1033483750
[2025-02-13 13:30:24,934][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e64b47e1-e2f5-4287-a406-ab5879579eca
[2025-02-13 13:30:40,107][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=c4a15bb9-e18a-4cab-bc1c-6a271ebe7846
[2025-02-13 13:30:55,265][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=844d41b7-e308-4d52-913d-ddd021d33dc0
[2025-02-13 13:31:10,443][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=e3036081-8bff-4418-a5e0-32e5c45df373
[2025-02-13 13:31:21,825][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.3 memory-total-bytes 64384258048 memory-used-bytes 7143792640 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003510784 total-disk-used-percent 0.4 user-disk-available-bytes 573554335744 network-sent-bytes-per-second 3273 network-recv-bytes-per-second 4331 disk-read-bytes-per-second 177076890 disk-write-bytes-per-second 1033503812
[2025-02-13 13:31:25,610][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=a95d207f-4de0-4673-9a21-521e6b030243
[2025-02-13 13:31:40,780][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=5cec304e-f242-439b-abec-e70345237cde
[2025-02-13 13:31:55,928][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=a559217b-92f7-43a1-afa4-b1650224315c
[2025-02-13 13:32:11,141][INFO    ] 📥 API.Resp 📥 [deadline:UpdateWorkerSchedule](200) params=(Duplicate removed, see previous response) request_id=84ed1574-1607-4959-a1ba-a0374891e273
[2025-02-13 13:32:21,836][INFO    ] 📊 Metrics.System 📊 cpu-usage-percent 8.5 memory-total-bytes 64384258048 memory-used-bytes 7137468416 memory-used-percent 11.1 swap-used-bytes 0 total-disk-bytes 912557846528 total-disk-used-bytes 339003211776 total-disk-used-percent 0.4 user-disk-available-bytes 573554634752 network-sent-bytes-per-second 6803 network-recv-bytes-per-second 4391 disk-read-bytes-per-second 177077581 disk-write-bytes-per-second 1033589709

@RaiaN
Copy link
Author

RaiaN commented Feb 17, 2025

Managed to reproduced it again. Using --quiet works as a workaround.

Frist worker task logs:

2025/02/17 14:35:04+00:00 Completed 1.1 GiB/1.1 GiB (90.0 MiB/s) with 32 file(s) remaining

Second worker task logs:

2025/02/17 14:37:28+00:00 Completed 620.4 MiB/2.1 GiB (49.6 MiB/s) with 770 file(s) remaining

Worker logs (first):

2025/02/17 14:34:44+00:00 {"level": "INFO", "ti": "🔷", "type": "Session", "subtype": "Info", "session_id": "session-331c087770a14743bbda65eb2339f539", "message": "Fully warmed Job Entity Cache", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-788328ed2cf744489b0ba8649afcc545"}
2025/02/17 14:34:45+00:00 {"level": "INFO", "ti": "🟢", "type": "Action", "subtype": "Start", "session_id": "session-331c087770a14743bbda65eb2339f539", "action_id": "sessionaction-331c087770a14743bbda65eb2339f539-0", "kind": "JobAttachSyncInput", "message": "Action started.", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-788328ed2cf744489b0ba8649afcc545"}
2025/02/17 14:34:45+00:00 {"level": "INFO", "ti": "📤", "type": "API", "subtype": "Req", "operation": "sts:GetCallerIdentity", "params": "*REDACTED*", "request_url": "*REDACTED*"}
2025/02/17 14:34:45+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "sts:GetCallerIdentity", "status_code": 200, "params": "*REDACTED*", "request_id": "d7c53cb9-f9cd-4a00-9c17-aef479b0021e"}
2025/02/17 14:34:47+00:00 {"level": "INFO", "ti": "🟣", "type": "Action", "subtype": "End", "session_id": "session-331c087770a14743bbda65eb2339f539", "action_id": "sessionaction-331c087770a14743bbda65eb2339f539-0", "kind": "JobAttachSyncInput", "message": "Action complete.", "status": "SUCCEEDED", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-788328ed2cf744489b0ba8649afcc545"}
2025/02/17 14:34:47+00:00 {"level": "INFO", "ti": "🟢", "type": "Action", "subtype": "Start", "session_id": "session-331c087770a14743bbda65eb2339f539", "action_id": "sessionaction-331c087770a14743bbda65eb2339f539-1", "kind": "TaskRun", "message": "Action started.", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-788328ed2cf744489b0ba8649afcc545", "step_id": "step-a4385ecf0d8b47cdb7af7d3da88847f1", "task_id": "task-a4385ecf0d8b47cdb7af7d3da88847f1-0"}
2025/02/17 14:34:48+00:00 {"level": "INFO", "ti": "🔷", "type": "Session", "subtype": "Runtime", "session_id": "session-331c087770a14743bbda65eb2339f539", "message": "Command started as pid: 7368", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-788328ed2cf744489b0ba8649afcc545"}
2025/02/17 14:34:59+00:00 {"level": "INFO", "ti": "📤", "type": "API", "subtype": "Req", "operation": "deadline:UpdateWorkerSchedule", "params": {"updatedSessionActions": {"sessionaction-331c087770a14743bbda65eb2339f539-0": {"startedAt": "2025-02-17T14:34:45.021985Z", "completedStatus": "SUCCEEDED", "endedAt": "2025-02-17T14:34:47.986226Z"}, "sessionaction-331c087770a14743bbda65eb2339f539-1": {"startedAt": "2025-02-17T14:34:47.995608Z", "updatedAt": "2025-02-17T14:34:48.017177Z"}}}, "request_url": "https://scheduling.deadline.us-east-1.amazonaws.com/2023-10-12/farms/farm-094170923091425c871369126a0cfa40/fleets/fleet-b08812c9245e47e39feb9fe711b11ca8/workers/worker-29c325576b1b4ab79997eae95257c401/schedule", "resource": {"farm-id": "farm-094170923091425c871369126a0cfa40", "fleet-id": "fleet-b08812c9245e47e39feb9fe711b11ca8", "worker-id": "worker-29c325576b1b4ab79997eae95257c401"}}
2025/02/17 14:34:59+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": {"assignedSessions": {"session-331c087770a14743bbda65eb2339f539": {"queueId": "queue-263a848e8e4e4f84af8cb5491b753147", "jobId": "job-788328ed2cf744489b0ba8649afcc545", "sessionActions": [{"sessionActionId": "sessionaction-331c087770a14743bbda65eb2339f539-1", "definition": {"taskRun": {"taskId": "task-a4385ecf0d8b47cdb7af7d3da88847f1-0", "stepId": "step-a4385ecf0d8b47cdb7af7d3da88847f1", "parameters": "*REDACTED*"}}}], "logConfiguration": {"logDriver": "awslogs", "options": {"logGroupName": "/aws/deadline/farm-094170923091425c871369126a0cfa40/queue-263a848e8e4e4f84af8cb5491b753147", "logStreamName": "session-331c087770a14743bbda65eb2339f539"}, "parameters": {}}}}, "cancelSessionActions": {}, "updateIntervalSeconds": 15}, "request_id": "a809b957-2eaf-43f6-9be7-a208ce0edcf6"}
2025/02/17 14:35:15+00:00 {"level": "INFO", "ti": "📤", "type": "API", "subtype": "Req", "operation": "deadline:UpdateWorkerSchedule", "params": {"updatedSessionActions": {}}, "request_url": "https://scheduling.deadline.us-east-1.amazonaws.com/2023-10-12/farms/farm-094170923091425c871369126a0cfa40/fleets/fleet-b08812c9245e47e39feb9fe711b11ca8/workers/worker-29c325576b1b4ab79997eae95257c401/schedule", "resource": {"farm-id": "farm-094170923091425c871369126a0cfa40", "fleet-id": "fleet-b08812c9245e47e39feb9fe711b11ca8", "worker-id": "worker-29c325576b1b4ab79997eae95257c401"}}
2025/02/17 14:35:15+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "037e435c-0615-4b0b-91af-0e68f0e1514b"}
2025/02/17 14:35:30+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "8a953883-5731-41c6-989b-7a87c4de2456"}
2025/02/17 14:35:42+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "0.0", "memory-total-bytes": "64384258048", "memory-used-bytes": "7795363840", "memory-used-percent": "12.1", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "342666813440", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "569891033088", "network-sent-bytes-per-second": "59236", "network-recv-bytes-per-second": "20187149", "disk-read-bytes-per-second": "340620587", "disk-write-bytes-per-second": "1959594428"}
2025/02/17 14:35:46+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "11eb6853-cc6b-472d-867d-4ec54ff6df27"}
2025/02/17 14:36:01+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "3571eb62-8d5c-4769-b37c-1dc90fba81f4"}
2025/02/17 14:36:16+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "7c357193-666a-4bf6-ad31-5487ff4d46ba"}
2025/02/17 14:36:32+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "2d2f88a6-5cb6-4a47-bc50-224dbe595c84"}
2025/02/17 14:36:42+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "7.5", "memory-total-bytes": "64384258048", "memory-used-bytes": "7621877760", "memory-used-percent": "11.8", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "342666817536", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "569891028992", "network-sent-bytes-per-second": "633", "network-recv-bytes-per-second": "1211", "disk-read-bytes-per-second": "340706876", "disk-write-bytes-per-second": "1962605005"}
2025/02/17 14:36:47+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "88a45435-365e-4ed3-bbc5-8a2fe7a70f18"}
2025/02/17 14:37:03+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "600bc585-030e-482d-b04e-b837fbc11194"}
2025/02/17 14:37:18+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "23e17430-cb83-4421-98b2-bd425c441608"}
2025/02/17 14:37:33+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "c27ca539-4453-4887-b7c9-dc76f21849ca"}
2025/02/17 14:37:42+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "7.5", "memory-total-bytes": "64384258048", "memory-used-bytes": "7620362240", "memory-used-percent": "11.8", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "342666817536", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "569891028992", "network-sent-bytes-per-second": "763", "network-recv-bytes-per-second": "4811", "disk-read-bytes-per-second": "340712337", "disk-write-bytes-per-second": "1962659294"}
2025/02/17 14:37:49+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "c6e43231-7cff-45a9-b3ed-13ba70f1db4d"}
2025/02/17 14:38:04+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "ba7a6b68-4e63-470c-9159-b2651d79eb5f"}
2025/02/17 14:38:20+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "9794d70d-d481-44dd-8d5e-36b8ece1d9dd"}
2025/02/17 14:38:35+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "ee7c3791-074d-4029-9f4f-29a056600d0b"}
2025/02/17 14:38:42+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "12.4", "memory-total-bytes": "64384258048", "memory-used-bytes": "7618764800", "memory-used-percent": "11.8", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "342666817536", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "569891028992", "network-sent-bytes-per-second": "661", "network-recv-bytes-per-second": "1298", "disk-read-bytes-per-second": "340712337", "disk-write-bytes-per-second": "1962671795"}
2025/02/17 14:38:50+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "5fce78e4-7c85-499c-b33e-a5c78a05a847"}
2025/02/17 14:39:06+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "c55e45d3-b165-4cb8-8b54-1c0db9b2f15d"}
2025/02/17 14:39:21+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "6460581e-020e-4438-a7a2-4bae9b6620fe"}
2025/02/17 14:39:37+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "47e29b97-86b9-44b0-a65f-a967e9ef1b89"}
2025/02/17 14:39:42+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "0.0", "memory-total-bytes": "64384258048", "memory-used-bytes": "7621111808", "memory-used-percent": "11.8", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "342666821632", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "569891024896", "network-sent-bytes-per-second": "792", "network-recv-bytes-per-second": "4926", "disk-read-bytes-per-second": "340712337", "disk-write-bytes-per-second": "1962684339"}

Worker logs (second):

2025/02/17 14:36:49+00:00 {"level": "INFO", "ti": "🟢", "type": "Action", "subtype": "Start", "session_id": "session-84478c4fe47e40b8977fb17f58f154be", "action_id": "sessionaction-84478c4fe47e40b8977fb17f58f154be-1", "kind": "TaskRun", "message": "Action started.", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-afb13af6a8f241db811df23e35724493", "step_id": "step-76a1c6b80f5f4343acd637f69336df42", "task_id": "task-76a1c6b80f5f4343acd637f69336df42-0"}
2025/02/17 14:36:49+00:00 {"level": "INFO", "ti": "🔷", "type": "Session", "subtype": "Runtime", "session_id": "session-84478c4fe47e40b8977fb17f58f154be", "message": "Command started as pid: 2780", "queue_id": "queue-263a848e8e4e4f84af8cb5491b753147", "job_id": "job-afb13af6a8f241db811df23e35724493"}
2025/02/17 14:37:00+00:00 {"level": "INFO", "ti": "📤", "type": "API", "subtype": "Req", "operation": "deadline:UpdateWorkerSchedule", "params": {"updatedSessionActions": {"sessionaction-84478c4fe47e40b8977fb17f58f154be-0": {"startedAt": "2025-02-17T14:36:46.423784Z", "completedStatus": "SUCCEEDED", "endedAt": "2025-02-17T14:36:49.367645Z"}, "sessionaction-84478c4fe47e40b8977fb17f58f154be-1": {"startedAt": "2025-02-17T14:36:49.440493Z", "updatedAt": "2025-02-17T14:36:49.463997Z"}}}, "request_url": "https://scheduling.deadline.us-east-1.amazonaws.com/2023-10-12/farms/farm-094170923091425c871369126a0cfa40/fleets/fleet-b08812c9245e47e39feb9fe711b11ca8/workers/worker-cfb1e5ce843445c78cf04b97381fc418/schedule", "resource": {"farm-id": "farm-094170923091425c871369126a0cfa40", "fleet-id": "fleet-b08812c9245e47e39feb9fe711b11ca8", "worker-id": "worker-cfb1e5ce843445c78cf04b97381fc418"}}
2025/02/17 14:37:01+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": {"assignedSessions": {"session-84478c4fe47e40b8977fb17f58f154be": {"queueId": "queue-263a848e8e4e4f84af8cb5491b753147", "jobId": "job-afb13af6a8f241db811df23e35724493", "sessionActions": [{"sessionActionId": "sessionaction-84478c4fe47e40b8977fb17f58f154be-1", "definition": {"taskRun": {"taskId": "task-76a1c6b80f5f4343acd637f69336df42-0", "stepId": "step-76a1c6b80f5f4343acd637f69336df42", "parameters": "*REDACTED*"}}}], "logConfiguration": {"logDriver": "awslogs", "options": {"logGroupName": "/aws/deadline/farm-094170923091425c871369126a0cfa40/queue-263a848e8e4e4f84af8cb5491b753147", "logStreamName": "session-84478c4fe47e40b8977fb17f58f154be"}, "parameters": {}}}}, "cancelSessionActions": {}, "updateIntervalSeconds": 15}, "request_id": "b1b4c2dc-4099-482d-9847-199bdab15668"}
2025/02/17 14:37:16+00:00 {"level": "INFO", "ti": "📤", "type": "API", "subtype": "Req", "operation": "deadline:UpdateWorkerSchedule", "params": {"updatedSessionActions": {}}, "request_url": "https://scheduling.deadline.us-east-1.amazonaws.com/2023-10-12/farms/farm-094170923091425c871369126a0cfa40/fleets/fleet-b08812c9245e47e39feb9fe711b11ca8/workers/worker-cfb1e5ce843445c78cf04b97381fc418/schedule", "resource": {"farm-id": "farm-094170923091425c871369126a0cfa40", "fleet-id": "fleet-b08812c9245e47e39feb9fe711b11ca8", "worker-id": "worker-cfb1e5ce843445c78cf04b97381fc418"}}
2025/02/17 14:37:16+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "f83b4e2e-6a15-4bdb-8866-0f061d14a437"}
2025/02/17 14:37:26+00:00 {"level": "INFO", "ti": "📊", "type": "Metrics", "subtype": "System", "cpu-usage-percent": "7.6", "memory-total-bytes": "64384258048", "memory-used-bytes": "5465219072", "memory-used-percent": "8.5", "swap-used-bytes": "0", "total-disk-bytes": "912557846528", "total-disk-used-bytes": "344673226752", "total-disk-used-percent": "0.4", "user-disk-available-bytes": "567884619776", "network-sent-bytes-per-second": "155911", "network-recv-bytes-per-second": "39786425", "disk-read-bytes-per-second": "295033532", "disk-write-bytes-per-second": "1544184644"}
2025/02/17 14:37:32+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "9ffc9335-8118-47fe-8a0b-9e35dfd7fc1f"}
2025/02/17 14:37:47+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "2df86f35-762b-4bd8-81d5-f8aaefbeb1cd"}
2025/02/17 14:38:02+00:00 {"level": "INFO", "ti": "📥", "type": "API", "subtype": "Resp", "operation": "deadline:UpdateWorkerSchedule", "status_code": 200, "params": "(Duplicate removed, see previous response)", "request_id": "b2062b64-92c4-49a3-8b5d-4ac7ff7393fa"}

@RaiaN
Copy link
Author

RaiaN commented Feb 17, 2025

Our of curiosity I've replaced existing steps with just one to directly call aws s3 cp:

steps:
- name: Render Scene
  script:
    actions:
      onRun:
        command: aws
        args: ['s3', 'cp', 's3://BUCKET_NAME/BUCKET_SCENE', 'C:/LOCAL_SCENES', '--recursive']
        cancelation:
          mode: NOTIFY_THEN_TERMINATE
          notifyPeriodInSeconds: 2

Without --quiet -> hangs
With --quiet -> does not hang

@RaiaN
Copy link
Author

RaiaN commented Feb 18, 2025

Download step still going on (for ~23 hours):

2025/02/17 14:35:04+00:00 Completed 1.1 GiB/1.1 GiB (90.0 MiB/s) with 32 file(s) remaining 

@jusiskin
Copy link
Contributor

jusiskin commented Feb 18, 2025

Interesting find about the --quiet work-around. I'm wondering if the volume of log output produced by aws s3 cp --recursive ... without the --quiet flag is causing the issue.

In this issue, you've mentioned seeing output like:

2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.6 MiB/s) with 33 file(s) remaining
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining
2025/02/13 16:45:14+00:00 Completed 1.1 GiB/1.1 GiB (100.5 MiB/s) with 33 file(s) remaining

I'm wondering how many lines you see like this you are seeing in the task log. The last timestamp you've pasted here 3 hours ago shows yesterday's date. If that was a recent log message then it appears the worker agent is still catching up to upload the logs to CloudWatch.

For a bit of context here, the worker agent synchronizes the standard output from the running command to CloudWatch logs. It currently buffers the logs in memory while they are waiting be synchronized to CloudWatch. If the volume of log output is large enough, the physical RAM on the worker host can fill up. Depending on the OS configuration of your worker host, it will go to swap (disk-backed memory) which would degrade performance like you are seeing.

To help get an idea of the volume of logs we're dealing with, you could run the aws s3 cp ... command locally and pipe the standard output to a file. You can then check the size of the resulting file and also the number of lines of text, that would help confirm this theory.

@RaiaN
Copy link
Author

RaiaN commented Feb 18, 2025

@jusiskin

I'm wondering how many lines you see like this you are seeing in the task log. The last timestamp you've pasted here 3 hours ago shows yesterday's date.

Yes, there are NO new lines in the logs.

If the volume of log output is large enough, the physical RAM on the worker host can fill up

I don't believe this is the case because memory-used-percent is always < 20%.

the worker agent synchronizes the standard output from the running command to CloudWatch logs

Ahh! That might be the issue here, probably due to some kind of upload buffer size being used internally?

Let me check the file size of this operation: pipe standard output to a file.

@RaiaN
Copy link
Author

RaiaN commented Feb 18, 2025

Ok, std output file size is under 736KB. I believe there is some sort of buffering issue when uploading logs to CloudWatch.

The download procedure definitely doesn't go use swap file. Used RAM is also < 15%.

@jusiskin
Copy link
Contributor

Another possibility is that your AWS account's service quota limit for PutLogEvents TPS is being reached. There is an account-wide limit for how many PutLogEvents API requests can be made per second for a given account/region (docs). At the time of writing this, it states:

5000 transactions per second per account per Region You can request an increase to the per-second throttling quota by using the Service Quotas service.

If you have a fleet of many workers, each with a high volume of logs from this aws s3 cp operation, then they may all be trying to upload these logs at the same time. If this is the case, they may get throttled and log upload will take longer.

If the jobs are still running or were running within the last 3 hours, you can confirm this by:

  1. Login to the AWS Console for the relevant AWS account
  2. Go the the Service Quotas console
  3. Click AWS Services on the navigation pane on the left-hand side
  4. Type Amazon CloudWatch Logs in the search bar and click the result with the same name to be taken to the service quotas for CloudWatch logs
  5. Type PutLogEvents in the search bar and click PutLogEvents throttle limit in transactions per second
  6. The service quota detail page shows a graph showing the account's rate of PutLogEvents requests for the last 3 hours.

@RaiaN
Copy link
Author

RaiaN commented Feb 18, 2025

@jusiskin Re quotas, it is 5000 / second indeed.

However, what is not clear to me is how so Deadline worker can just be idle for so long and don't do any checks wrt progress aka Timeout?

If you have a fleet of many workers

Nah, this happens at a single worker at a time, other ones are idle

@RaiaN
Copy link
Author

RaiaN commented Feb 18, 2025

Image
I doubt we hit the quota

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed needs triage A new issue that needs a first look question Further information is requested response-requested A response from the contributor has been requested.
Projects
None yet
Development

No branches or pull requests

3 participants