Bug: Envoy Proxy segfault when nofile ulimit reached #181

nathanpeck · 2020-03-27T22:05:37Z

Summary
When Envoy hits the nofile ulimit it crashes with a segfault

Steps to Reproduce
Envoy proxy sidecar in an ECS task that has the default ulimits. stats and dog statsd turned on, x-ray turned off. Send a large amount of concurrent traffic, such that there are enough open sockets to exhaust file descriptors.

Are you currently working around this issue?
Raise the ulimits to allow more file descriptors

Additional context

1585346044212,[2020-03-27 21:54:04.212][81][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][77][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][73][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,"[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Aborted, suspect faulting address 0x53900000001"
1585346044212,[2020-03-27 21:54:04.212][70][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][69][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][66][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][58][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][63][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,"[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:83] Caught Segmentation fault, suspect faulting address 0x0"
1585346044212,[2020-03-27 21:54:04.212][83][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][79][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][61][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][55][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][54][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][75][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 01eac90b9c3ca11fb28aee8e2f6a39df16c87508/1.13.1/Clean/RELEASE/BoringSSL
1585346044212,[2020-03-27 21:54:04.212][59][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[symbolize_elf.inc : 951] RAW: /proc/self/task/1/maps: errno=24
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:70] Backtrace (use tools/stack_decode.py to get line numbers):
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #0: [0x7f6bc97247e0]
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #1: [0x5585ae5cf4e5]
1585346044212,[2020-03-27 21:54:04.212][67][critical][assert] [source/common/network/listener_impl.cc:79] panic: listener accept failure: Too many open files
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:71] Envoy version: 01eac90b9c3ca11fb28aee8e2f6a39df16c87508/1.13.1/Clean/RELEASE/BoringSSL
1585346044212,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #0: [0x7f6bc97247e0]
1585346044212,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #2: [0x5585ae5cd63b]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #3: [0x5585ae5cbece]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #4: [0x5585ae293cb6]
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #5: [0x5585ae7b5435]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #1: [0x5585ae5cf4e5]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #2: [0x5585ae5cd63b]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #3: [0x5585ae5cbece]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #4: [0x5585ae293cb6]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #5: [0x5585ae7b5435]
1585346044213,[2020-03-27 21:54:04.212][81][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #6: [0x7f6bc971a40b]
1585346044213,[symbolize_elf.inc : 951] RAW: /proc/self/task/1/maps: errno=24
1585346044213,[2020-03-27 21:54:04.212][77][critical][backtrace] [bazel-out/k8-opt/bin/source/server/_virtual_includes/backtrace_lib/server/backtrace.h:77] #6: [0x7f6bc971a40b]

The text was updated successfully, but these errors were encountered:

abaptiste · 2020-03-30T18:36:15Z

This is an explicit panic in Envoy when it fails to accept an incoming connection.

Is it possible for you to get the value of:

$ cat /proc/sys/fs/file-max

before your test begins?

abaptiste · 2020-03-30T18:54:31Z

I did find a couple docs that talk about specifying ulimits in task definitions:

Once we determine what the existing values are, we can try updating the task definitions and see whether the problem continues.

nathanpeck · 2020-03-30T23:22:54Z

The existing nofile ulimit was the default of 1024. I already raised by ulimit to stop the crash.

However I do not think it is right for Envoy to be crashing with a segfault when the limit is reached. For example in Nginx if this ulimit is reached you will get a timeout or connection reset for that individual request and an error in the logs, but the server stays up and other requests are still handled (up to the limit). My expectation is that the software should be able to handle this edgecase gracefully and begin dropping requests while still maintaining a baseline level of traffic, rather than blowing up completely and terminating.

I'm not sure whether this crash is coming from Envoy core or an App Mesh specific addition though

lavignes · 2020-03-30T23:49:25Z

Maybe we're running out of memory during the panic?

abaptiste · 2020-03-31T00:06:19Z

https://github.com/envoyproxy/envoy/blob/master/source/common/network/listener_impl.cc#L76-L80

This is a deliberate decision made in Envoy.

Until we decide to change that, or whether we should change that, what's the suggested recourse in this situation?

lavignes · 2020-03-31T00:15:32Z

Yeah since PANIC calls abort() its kind of undefined-behavior for the other threads. The main thread probably starts stack-unwinding and calling destructors (leading other threads to potentially segfault).

nathanpeck · 2020-03-31T03:23:10Z

Okay I'm guessing based on @mattklein123 "Crash Early and Crash Often" article on Medium that this crash might be an intended behavior in Envoy.

However the crash seems poorly documented and crashing like that is not a typical behavior for AWS related services, so App Mesh users are just going to think that there is something wrong with the service. I'd recommend we add a troubleshooting section in the App Mesh documentation where we mention this crash in specific and include instructions on how to raise ulimits on EC2, ECS, and EKS

mattklein123 · 2020-03-31T16:21:39Z

@nathanpeck I will add a FAQ entry on this topic as it comes up occasionally. The TL/DR is this is not something we are going to change as gracefully handling FD exhaustion can lead to unexpected behavior in terms of not handling HC connections, other admin tasks, etc. It's much better to control Envoy resource usage with all of the other knobs we have (circuit breakers, overflow manager, etc.). Anyway I will add a FAQ entry.

nathanpeck · 2020-03-31T17:01:50Z

Thanks Matt! That makes a lot of sense

nathanpeck added the Bug Something isn't working label Mar 27, 2020

LancerRainier assigned abaptiste Mar 30, 2020

nathanpeck closed this as completed Mar 31, 2020

cartermckinnon mentioned this issue Dec 22, 2023

File descriptor limit change in AMI release v20231220 awslabs/amazon-eks-ami#1551

Closed

polarathene mentioned this issue Dec 22, 2023

Support raising the soft limit envoyproxy/envoy#31502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Envoy Proxy segfault when nofile ulimit reached #181

Bug: Envoy Proxy segfault when nofile ulimit reached #181

nathanpeck commented Mar 27, 2020 •

edited

Loading

abaptiste commented Mar 30, 2020

abaptiste commented Mar 30, 2020

nathanpeck commented Mar 30, 2020 •

edited

Loading

lavignes commented Mar 30, 2020

abaptiste commented Mar 31, 2020

lavignes commented Mar 31, 2020

nathanpeck commented Mar 31, 2020 •

edited

Loading

mattklein123 commented Mar 31, 2020

nathanpeck commented Mar 31, 2020

Bug: Envoy Proxy segfault when nofile ulimit reached #181

Bug: Envoy Proxy segfault when nofile ulimit reached #181

Comments

nathanpeck commented Mar 27, 2020 • edited Loading

abaptiste commented Mar 30, 2020

abaptiste commented Mar 30, 2020

nathanpeck commented Mar 30, 2020 • edited Loading

lavignes commented Mar 30, 2020

abaptiste commented Mar 31, 2020

lavignes commented Mar 31, 2020

nathanpeck commented Mar 31, 2020 • edited Loading

mattklein123 commented Mar 31, 2020

nathanpeck commented Mar 31, 2020

nathanpeck commented Mar 27, 2020 •

edited

Loading

nathanpeck commented Mar 30, 2020 •

edited

Loading

nathanpeck commented Mar 31, 2020 •

edited

Loading