Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Envoy Proxy segfault when nofile ulimit reached #181

Closed
nathanpeck opened this issue Mar 27, 2020 · 9 comments
Closed

Bug: Envoy Proxy segfault when nofile ulimit reached #181

nathanpeck opened this issue Mar 27, 2020 · 9 comments
Assignees
Labels
Bug Something isn't working

Comments

@nathanpeck
Copy link
Member

nathanpeck commented Mar 27, 2020

Summary
When Envoy hits the nofile ulimit it crashes with a segfault

Steps to Reproduce
Envoy proxy sidecar in an ECS task that has the default ulimits. stats and dog statsd turned on, x-ray turned off. Send a large amount of concurrent traffic, such that there are enough open sockets to exhaust file descriptors.

Are you currently working around this issue?
Raise the ulimits to allow more file descriptors

Additional context

1585346044212,[2020-03-27 21:54:04.212][81][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][77][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][73][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,"[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Aborted, suspect faulting address 0x53900000001"
1585346044212,[2020-03-27 21:54:04.212][70][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][69][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][66][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][58][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][63][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,"[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x0"
1585346044212,[2020-03-27 21:54:04.212][83][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][79][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][61][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][55][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][54][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][75][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 01eac90b9c3ca11fb28aee8e2f6a39df16c87508/1.13.1/Clean/RELEASE/BoringSSL
1585346044212,[2020-03-27 21:54:04.212][59][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[symbolize_elf.inc : 951] RAW: /proc/self/task/1/maps: errno=24
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #0: [0x7f6bc97247e0]
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #1: [0x5585ae5cf4e5]
1585346044212,[2020-03-27 21:54:04.212][67][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 01eac90b9c3ca11fb28aee8e2f6a39df16c87508/1.13.1/Clean/RELEASE/BoringSSL
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #0: [0x7f6bc97247e0]
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #2: [0x5585ae5cd63b]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #3: [0x5585ae5cbece]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #4: [0x5585ae293cb6]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #5: [0x5585ae7b5435]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #1: [0x5585ae5cf4e5]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #2: [0x5585ae5cd63b]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #3: [0x5585ae5cbece]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #4: [0x5585ae293cb6]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #5: [0x5585ae7b5435]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #6: [0x7f6bc971a40b]
1585346044213,[symbolize_elf.inc : 951] RAW: /proc/self/task/1/maps: errno=24
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #6: [0x7f6bc971a40b]
@nathanpeck nathanpeck added the Bug Something isn't working label Mar 27, 2020
@abaptiste
Copy link

This is an explicit panic in Envoy when it fails to accept an incoming connection.

Is it possible for you to get the value of:

$ cat /proc/sys/fs/file-max

before your test begins?

@abaptiste
Copy link

I did find a couple docs that talk about specifying ulimits in task definitions:

Once we determine what the existing values are, we can try updating the task definitions and see whether the problem continues.

@nathanpeck
Copy link
Member Author

nathanpeck commented Mar 30, 2020

The existing nofile ulimit was the default of 1024. I already raised by ulimit to stop the crash.

However I do not think it is right for Envoy to be crashing with a segfault when the limit is reached. For example in Nginx if this ulimit is reached you will get a timeout or connection reset for that individual request and an error in the logs, but the server stays up and other requests are still handled (up to the limit). My expectation is that the software should be able to handle this edgecase gracefully and begin dropping requests while still maintaining a baseline level of traffic, rather than blowing up completely and terminating.

I'm not sure whether this crash is coming from Envoy core or an App Mesh specific addition though

@lavignes
Copy link

Maybe we're running out of memory during the panic?

@abaptiste
Copy link

https://github.com/envoyproxy/envoy/blob/master/source/common/network/listener_impl.cc#L76-L80

This is a deliberate decision made in Envoy.

Until we decide to change that, or whether we should change that, what's the suggested recourse in this situation?

@lavignes
Copy link

Yeah since PANIC calls abort() its kind of undefined-behavior for the other threads. The main thread probably starts stack-unwinding and calling destructors (leading other threads to potentially segfault).

@nathanpeck
Copy link
Member Author

nathanpeck commented Mar 31, 2020

Okay I'm guessing based on @mattklein123 "Crash Early and Crash Often" article on Medium that this crash might be an intended behavior in Envoy.

However the crash seems poorly documented and crashing like that is not a typical behavior for AWS related services, so App Mesh users are just going to think that there is something wrong with the service. I'd recommend we add a troubleshooting section in the App Mesh documentation where we mention this crash in specific and include instructions on how to raise ulimits on EC2, ECS, and EKS

@mattklein123
Copy link

@nathanpeck I will add a FAQ entry on this topic as it comes up occasionally. The TL/DR is this is not something we are going to change as gracefully handling FD exhaustion can lead to unexpected behavior in terms of not handling HC connections, other admin tasks, etc. It's much better to control Envoy resource usage with all of the other knobs we have (circuit breakers, overflow manager, etc.). Anyway I will add a FAQ entry.

@nathanpeck
Copy link
Member Author

Thanks Matt! That makes a lot of sense

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants