-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Envoy Proxy segfault when nofile ulimit reached #181
Comments
This is an explicit Is it possible for you to get the value of:
before your test begins? |
I did find a couple docs that talk about specifying ulimits in task definitions:
Once we determine what the existing values are, we can try updating the task definitions and see whether the problem continues. |
The existing nofile ulimit was the default of 1024. I already raised by ulimit to stop the crash. However I do not think it is right for Envoy to be crashing with a segfault when the limit is reached. For example in Nginx if this ulimit is reached you will get a timeout or connection reset for that individual request and an error in the logs, but the server stays up and other requests are still handled (up to the limit). My expectation is that the software should be able to handle this edgecase gracefully and begin dropping requests while still maintaining a baseline level of traffic, rather than blowing up completely and terminating. I'm not sure whether this crash is coming from Envoy core or an App Mesh specific addition though |
Maybe we're running out of memory during the panic? |
This is a deliberate decision made in Envoy. Until we decide to change that, or whether we should change that, what's the suggested recourse in this situation? |
Yeah since |
Okay I'm guessing based on @mattklein123 "Crash Early and Crash Often" article on Medium that this crash might be an intended behavior in Envoy. However the crash seems poorly documented and crashing like that is not a typical behavior for AWS related services, so App Mesh users are just going to think that there is something wrong with the service. I'd recommend we add a troubleshooting section in the App Mesh documentation where we mention this crash in specific and include instructions on how to raise ulimits on EC2, ECS, and EKS |
@nathanpeck I will add a FAQ entry on this topic as it comes up occasionally. The TL/DR is this is not something we are going to change as gracefully handling FD exhaustion can lead to unexpected behavior in terms of not handling HC connections, other admin tasks, etc. It's much better to control Envoy resource usage with all of the other knobs we have (circuit breakers, overflow manager, etc.). Anyway I will add a FAQ entry. |
Thanks Matt! That makes a lot of sense |
Summary
When Envoy hits the nofile ulimit it crashes with a segfault
Steps to Reproduce
Envoy proxy sidecar in an ECS task that has the default ulimits. stats and dog statsd turned on, x-ray turned off. Send a large amount of concurrent traffic, such that there are enough open sockets to exhaust file descriptors.
Are you currently working around this issue?
Raise the ulimits to allow more file descriptors
Additional context
The text was updated successfully, but these errors were encountered: