You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have the DEV and STG ECS clusters running on Windows-based EC2, which start and stop automatically by schedule/user request (just a simple Lambda that scales in/out all the resources ASG/ECS task/etc). From time to time the ec2 instances are stuck with high CPU usage on executing the user-data script.
Description
For approximately 8-10 months we struggling with the issue when the ec2 instance starts and hangs with ~60% CPU usage after a manual startup request on executing a simple user-data script (UserScript.ps1) step, see the screenshots:
This issue happens sporadically and we are not able to reproduce it in a controlled way (via the start/stop environment procedure multiple times).
It's not possible to connect to the ec2 instance with SSM during the high load so if we connect via RDP -> terminate the CPU consumption powershell process (pid 2260 from the screenshot) -> re-run the same lines from UserScript.ps1 in additional cmd window -> this ec2 brings up and connects to ECS service as expected (we can see the ec2 id in ECS->Cluster->infrastructure->Container instances).
Moreover, if we just terminate the problematic ec2, ASG will create a new instance which will start serving as expected.
The last messages from ECSTools log:
2024-12-11T08:00:04Z - [INFO]:Network adapter found with mac 00-15-5D-0F-82-58 on interface 2
2024-12-11T08:00:04Z - [INFO]:Getting subnet info from docker...
2024-12-11T08:00:04Z - [DEBUG]:Docker nat network config is: []
2024-12-11T08:00:13Z - [INFO]:Docker subnet: 172.27.224.0/20 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Docker gateway: 172.27.224.1 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Getting net ip address
2024-12-11T08:00:14Z - [INFO]:IP address not found.
Name Value
---- -----
PrefixLength 32
IPAddress 169.254.170.2
InterfaceIndex 2
2024-12-11T08:00:14Z - [INFO]:Creating new virtual network adapter ip...
2024-12-11T08:00:14Z - [INFO]:Virtual network adapter ip created:
2024-12-11T08:00:14Z - [INFO]:Waiting for it to become available on the device...
Please pay extra attention to: "Virtual network adapter ip created: " there is no IP definition in this message
What we tried:
increase the instance size from t3.* to c7i.large
Use Core instead of Full Windows image
Played with a few previous AMI versions
Expected Behavior
ECS Agent service starts and connects to ECS service.
Observed Behavior
The EC2 instance is stuck due to high CPU usage and will never join the ECS cluster and just continue running.
Environment Details
AMI id: ami-0a47ff78b42b54ea3 (Windows_Server-2019-English-Full-ECS_Optimized-2024.11.13, Core version is also affected)
docker info:
Client:
Version: 25.0.6.m
Context: default
Debug Mode: false
Server:
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 6
Server Version: 25.0.6
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics internal l2bridge l2tunnel nat null overlay private transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local splunk syslog
Swarm: inactive
Default Isolation: process
Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
Operating System: Microsoft Windows Server Version 1809 (OS Build 17763.6532)
OSType: windows
Architecture: x86_64
CPUs: 2
Total Memory: 3.906GiB
Name: EC2AMAZ-C43Q8DO
ID: a6731a3f-f657-45e9-9c98-72bb7497f825
Docker Root Dir: C:\ProgramData\docker
Debug Mode: false
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
PS C:\Users\Administrator> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing
Invoke-WebRequest : Unable to connect to the remote server
...
We'll be shipping a change in January 2025 AMIs that will prevent this condition from triggering. There was a case in the ECSTools module that could cause an infinite loop to occur if the virtual network adapter IP could never be created. The change we are making won't fix the issue directly, however this will allow the module to exit with an error and allow other logs to be emitted to give more information on what is causing the error.
I will take a look at your attached zip and make a follow up comment if I notice anything worth mentioning
Summary
We have the DEV and STG ECS clusters running on Windows-based EC2, which start and stop automatically by schedule/user request (just a simple Lambda that scales in/out all the resources ASG/ECS task/etc). From time to time the ec2 instances are stuck with high CPU usage on executing the user-data script.
Description
For approximately 8-10 months we struggling with the issue when the ec2 instance starts and hangs with ~60% CPU usage after a manual startup request on executing a simple user-data script (UserScript.ps1) step, see the screenshots:
UserScript.ps1 content:
This issue happens sporadically and we are not able to reproduce it in a controlled way (via the start/stop environment procedure multiple times).
It's not possible to connect to the ec2 instance with SSM during the high load so if we connect via RDP -> terminate the CPU consumption powershell process (pid 2260 from the screenshot) -> re-run the same lines from UserScript.ps1 in additional cmd window -> this ec2 brings up and connects to ECS service as expected (we can see the ec2 id in ECS->Cluster->infrastructure->Container instances).
Moreover, if we just terminate the problematic ec2, ASG will create a new instance which will start serving as expected.
The last messages from ECSTools log:
Please pay extra attention to: "Virtual network adapter ip created: " there is no IP definition in this message
What we tried:
Expected Behavior
ECS Agent service starts and connects to ECS service.
Observed Behavior
The EC2 instance is stuck due to high CPU usage and will never join the ECS cluster and just continue running.
Environment Details
AMI id: ami-0a47ff78b42b54ea3 (Windows_Server-2019-English-Full-ECS_Optimized-2024.11.13, Core version is also affected)
docker info:
curl http://localhost:51678/v1/metadata:
while instance stuck:
After manual re-exec of UserScript.ps1:
Supporting Log Snippets
All the logs from C:\ProgramData\Amazon\ECS\log:
ECS_log.zip
The text was updated successfully, but these errors were encountered: