Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows-based AMI hangs with high CPU usage on user-data exec #4453

Open
SerhiiMy opened this issue Dec 11, 2024 · 2 comments
Open

Windows-based AMI hangs with high CPU usage on user-data exec #4453

SerhiiMy opened this issue Dec 11, 2024 · 2 comments

Comments

@SerhiiMy
Copy link

SerhiiMy commented Dec 11, 2024

Summary

We have the DEV and STG ECS clusters running on Windows-based EC2, which start and stop automatically by schedule/user request (just a simple Lambda that scales in/out all the resources ASG/ECS task/etc). From time to time the ec2 instances are stuck with high CPU usage on executing the user-data script.

Description

For approximately 8-10 months we struggling with the issue when the ec2 instance starts and hangs with ~60% CPU usage after a manual startup request on executing a simple user-data script (UserScript.ps1) step, see the screenshots:
powershell_top_cpu
process_explorer
cloudwatch

UserScript.ps1 content:

Import-Module ECSTools
[Environment]::SetEnvironmentVariable("ECS_ENABLE_AWSLOGS_EXECUTIONROLE_OVERRIDE", $TRUE, "Machine")
Initialize-ECSAgent -Cluster my-cluster-name-c1c68ba3-stage -EnableTaskIAMRole -LoggingDrivers '["json-file","awslogs"]'

This issue happens sporadically and we are not able to reproduce it in a controlled way (via the start/stop environment procedure multiple times).
It's not possible to connect to the ec2 instance with SSM during the high load so if we connect via RDP -> terminate the CPU consumption powershell process (pid 2260 from the screenshot) -> re-run the same lines from UserScript.ps1 in additional cmd window -> this ec2 brings up and connects to ECS service as expected (we can see the ec2 id in ECS->Cluster->infrastructure->Container instances).
Moreover, if we just terminate the problematic ec2, ASG will create a new instance which will start serving as expected.

The last messages from ECSTools log:

2024-12-11T08:00:04Z - [INFO]:Network adapter found with mac 00-15-5D-0F-82-58 on interface 2
2024-12-11T08:00:04Z - [INFO]:Getting subnet info from docker...
2024-12-11T08:00:04Z - [DEBUG]:Docker nat network config is: []

2024-12-11T08:00:13Z - [INFO]:Docker subnet: 172.27.224.0/20 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Docker gateway: 172.27.224.1 on attempt 29
2024-12-11T08:00:13Z - [INFO]:Getting net ip address
2024-12-11T08:00:14Z - [INFO]:IP address not found. 
Name                           Value                                                                                   
----                           -----                                                                                   
PrefixLength                   32                                                                                      
IPAddress                      169.254.170.2                                                                           
InterfaceIndex                 2                                                                                       

2024-12-11T08:00:14Z - [INFO]:Creating new virtual network adapter ip...
2024-12-11T08:00:14Z - [INFO]:Virtual network adapter ip created: 
2024-12-11T08:00:14Z - [INFO]:Waiting for it to become available on the device...

Please pay extra attention to: "Virtual network adapter ip created: " there is no IP definition in this message

What we tried:

  • increase the instance size from t3.* to c7i.large
  • Use Core instead of Full Windows image
  • Played with a few previous AMI versions

Expected Behavior

ECS Agent service starts and connects to ECS service.

Observed Behavior

The EC2 instance is stuck due to high CPU usage and will never join the ECS cluster and just continue running.

Environment Details

AMI id: ami-0a47ff78b42b54ea3 (Windows_Server-2019-English-Full-ECS_Optimized-2024.11.13, Core version is also affected)

docker info:

Client:
 Version:    25.0.6.m
 Context:    default
 Debug Mode: false

Server:
 Containers: 0
  Running: 0
  Paused: 0
  Stopped: 0
 Images: 6
 Server Version: 25.0.6
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics internal l2bridge l2tunnel nat null overlay private transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Microsoft Windows Server Version 1809 (OS Build 17763.6532)
 OSType: windows
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.906GiB
 Name: EC2AMAZ-C43Q8DO
 ID: a6731a3f-f657-45e9-9c98-72bb7497f825
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

curl http://localhost:51678/v1/metadata:
while instance stuck:

PS C:\Users\Administrator> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing
Invoke-WebRequest : Unable to connect to the remote server
...

After manual re-exec of UserScript.ps1:

C:\Users\Administrator> Invoke-WebRequest -Uri http://localhost:51678/v1/metadata -UseBasicParsing

StatusCode        : 200
StatusDescription : OK
Content           : {"Cluster":"my-cluster-name-c1c68ba3-stage","ContainerInstanceArn":"arn:aws:ecs:us-east-1:123456789
                    40:container-instance/my-cluster-name-c1c68ba3-stage/056de432474e4fbc8a453e519d75e417","Vers...
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 245
                    Content-Type: application/json
                    Date: Wed, 11 Dec 2024 12:32:16 GMT

                    {"Cluster":"my-cluster-name-c1c68ba3-stage","ContainerInstanceArn":"arn:aws:ecs:us-east...
Forms             :
Headers           : {[Content-Length, 245], [Content-Type, application/json], [Date, Wed, 11 Dec 2024 12:32:16 GMT]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        :
RawContentLength  : 245

Supporting Log Snippets

All the logs from C:\ProgramData\Amazon\ECS\log:
ECS_log.zip

@mcregan23
Copy link

mcregan23 commented Dec 19, 2024

Hey @SerhiiMy ,

We'll be shipping a change in January 2025 AMIs that will prevent this condition from triggering. There was a case in the ECSTools module that could cause an infinite loop to occur if the virtual network adapter IP could never be created. The change we are making won't fix the issue directly, however this will allow the module to exit with an error and allow other logs to be emitted to give more information on what is causing the error.

I will take a look at your attached zip and make a follow up comment if I notice anything worth mentioning

@SerhiiMy
Copy link
Author

@mcregan23, thank you very much for the update. If the issue is reproduced after the AMI update, I'll get back to this case with additional details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants