Windows-based AMI hangs with high CPU usage on user-data exec #4453

SerhiiMy · 2024-12-11T13:31:48Z

Summary

We have the DEV and STG ECS clusters running on Windows-based EC2, which start and stop automatically by schedule/user request (just a simple Lambda that scales in/out all the resources ASG/ECS task/etc). From time to time the ec2 instances are stuck with high CPU usage on executing the user-data script.

Description

For approximately 8-10 months we struggling with the issue when the ec2 instance starts and hangs with ~60% CPU usage after a manual startup request on executing a simple user-data script (UserScript.ps1) step, see the screenshots:

UserScript.ps1 content:

Import-Module ECSTools
[Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE", $TRUE, "Machine")
Initialize-ECSAgent -Cluster my-cluster-name-c1c68ba3-stage -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'

This issue happens sporadically and we are not able to reproduce it in a controlled way (via the start/stop environment procedure multiple times).
It's not possible to connect to the ec2 instance with SSM during the high load so if we connect via RDP -> terminate the CPU consumption powershell process (pid 2260 from the screenshot) -> re-run the same lines from UserScript.ps1 in additional cmd window -> this ec2 brings up and connects to ECS service as expected (we can see the ec2 id in ECS->Cluster->infrastructure->Container instances).
Moreover, if we just terminate the problematic ec2, ASG will create a new instance which will start serving as expected.

The last messages from ECSTools log:

2024-12-11T08:00:04Z - [INFO]:Network adapter found with mac 00-15-5D-0F-82-58 on interface 2
2024-12-11T08:00:04Z - [INFO]:Getting subnet info from docker...
2024-12-11T08:00:04Z - [DEBUG]:Docker nat network config is: []

2024-12-11T08:00:13Z - [INFO]:Docker subnet: 172.27.224.0/20 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Docker gateway: 172.27.224.1 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Getting net ip address
2024-12-11T08:00:14Z - [INFO]:IP address not found. 
Name                           Value                                                                                   
----                           -----                                                                                   
PrefixLength                   32                                                                                      
IPAddress                      169.254.170.2                                                                           
InterfaceIndex                 2                                                                                       

2024-12-11T08:00:14Z - [INFO]:Creating new virtual network adapter ip...
2024-12-11T08:00:14Z - [INFO]:Virtual network adapter ip created: 
2024-12-11T08:00:14Z - [INFO]:Waiting for it to become available on the device...

Please pay extra attention to: "Virtual network adapter ip created: " there is no IP definition in this message

What we tried:

increase the instance size from t3.* to c7i.large
Use Core instead of Full Windows image
Played with a few previous AMI versions

Expected Behavior

ECS Agent service starts and connects to ECS service.

Observed Behavior

The EC2 instance is stuck due to high CPU usage and will never join the ECS cluster and just continue running.

Environment Details

AMI id: ami-0a47ff78b42b54ea3 (Windows_Server-2019-English-Full-ECS_Optimized-2024.11.13, Core version is also affected)

docker info:

Client:
 Version:    25.0.6.m
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 6
 Server Version: 25.0.6
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Microsoft Windows Server Version 1809 (OS Build 17763.6532)
 OSType: windows
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.906GiB
 Name: EC2AMAZ-C43Q8DO
 ID: a6731a3f-f657-45e9-9c98-72bb7497f825
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

curl http://localhost:51678/v1/metadata:
while instance stuck:

PS C:\Users\Administrator> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing
Invoke-WebRequest : Unable to connect to the remote server
...

After manual re-exec of UserScript.ps1:

C:\Users\Administrator> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing

StatusCode        : 200
StatusDescription : OK
Content           : {"Cluster":"my-cluster-name-c1c68ba3-stage","ContainerInstanceArn":"arn:aws:ecs:us-east-1:123456789
                    40:container-instance/my-cluster-name-c1c68ba3-stage/056de432474e4fbc8a453e519d75e417","Vers...
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 245
                    Content-Type: application/json
                    Date: Wed, 11 Dec 2024 12:32:16 GMT

                    {"Cluster":"my-cluster-name-c1c68ba3-stage","ContainerInstanceArn":"arn:aws:ecs:us-east...
Forms             :
Headers           : {[Content-Length, 245], [Content-Type, application/json], [Date, Wed, 11 Dec 2024 12:32:16 GMT]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        :
RawContentLength  : 245

Supporting Log Snippets

All the logs from C:\ProgramData\Amazon\ECS\log:
ECS_log.zip

The text was updated successfully, but these errors were encountered:

mcregan23 · 2024-12-19T21:39:26Z

Hey @SerhiiMy ,

We'll be shipping a change in January 2025 AMIs that will prevent this condition from triggering. There was a case in the ECSTools module that could cause an infinite loop to occur if the virtual network adapter IP could never be created. The change we are making won't fix the issue directly, however this will allow the module to exit with an error and allow other logs to be emitted to give more information on what is causing the error.

I will take a look at your attached zip and make a follow up comment if I notice anything worth mentioning

SerhiiMy · 2024-12-20T06:47:31Z

@mcregan23, thank you very much for the update. If the issue is reproduced after the AMI update, I'll get back to this case with additional details.

amogh09 added the os/windows label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows-based AMI hangs with high CPU usage on user-data exec #4453

Windows-based AMI hangs with high CPU usage on user-data exec #4453

SerhiiMy commented Dec 11, 2024 •

edited

Loading

mcregan23 commented Dec 19, 2024 •

edited

Loading

SerhiiMy commented Dec 20, 2024

Windows-based AMI hangs with high CPU usage on user-data exec #4453

Windows-based AMI hangs with high CPU usage on user-data exec #4453

Comments

SerhiiMy commented Dec 11, 2024 • edited Loading

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

mcregan23 commented Dec 19, 2024 • edited Loading

SerhiiMy commented Dec 20, 2024

SerhiiMy commented Dec 11, 2024 •

edited

Loading

mcregan23 commented Dec 19, 2024 •

edited

Loading