Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault due to opentracing #2222

Closed
trnl opened this issue Mar 20, 2018 · 15 comments · Fixed by #2676
Closed

segfault due to opentracing #2222

trnl opened this issue Mar 20, 2018 · 15 comments · Fixed by #2676
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@trnl
Copy link
Contributor

trnl commented Mar 20, 2018

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):
no

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):
nginx, opentracing, zipkin, segfault

Is this a BUG REPORT or FEATURE REQUEST? (choose one): BUG REPORT

NGINX Ingress controller version: 0.10.2

Kubernetes version (use kubectl version):

> k version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.6", GitCommit:"4bc5e7f9a6c25dc4c03d4d656f2cefd21540e28c", GitTreeState:"clean", BuildDate:"2017-09-15T22:14:38Z", GoVersion:"go1.9", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.12", GitCommit:"3bda299a6414b4866f179921610d6738206a18fe", GitTreeState:"clean", BuildDate:"2017-12-29T08:39:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Cloud provider or hardware configuration: 8 CPU, 32GB RAM, VMWare vRealize
  • OS (e.g. from /etc/os-release):
> cat  /etc/os-release
NAME="Red Hat Enterprise Linux Server"
VERSION="7.3 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="7.3"
  • Kernel (e.g. uname -a):
> uname -a
Linux ... 3.10.0-693.11.1.el7.x86_64 #1 SMP Fri Oct 27 05:39:05 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    Docker, Kubernetes
  • Others:
    nginx-ingress resource limits
          requests:
            cpu: 500m
            memory: 1Gi
          limits:
            cpu: 1000m
            memory: 2Gi

nginx-ingress configmap values

  enable-opentracing: "true"
  zipkin-collector-host: zipkin.tools
  zipkin-collector-port: '80'
  zipkin-service-name: ingress-controller

What happened:
worker process exiting due to segfault

What you expected to happen:
no segfaults, no worker restarts

How to reproduce it (as minimally and precisely as possible):
Happens almost every 2 seconds

{"log":"2018/03/20 18:34:54 [alert] 23#23: worker process 11075 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:54.142872722Z"}
{"log":"2018/03/20 18:34:54 [alert] 23#23: worker process 11109 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:54.517182591Z"}
{"log":"2018/03/20 18:34:57 [alert] 23#23: worker process 11143 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:57.923391335Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11178 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.194986234Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11212 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.557907914Z"}
{"log":"2018/03/20 18:34:58 [alert] 23#23: worker process 11246 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:58.816366263Z"}
{"log":"2018/03/20 18:34:59 [alert] 23#23: worker process 11280 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:34:59.19434845Z"}
{"log":"2018/03/20 18:35:02 [alert] 23#23: worker process 11314 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:02.943432758Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11349 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.206377447Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11383 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.56327926Z"}
{"log":"2018/03/20 18:35:03 [alert] 23#23: worker process 11417 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:03.81832701Z"}
{"log":"2018/03/20 18:35:08 [alert] 23#23: worker process 11451 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:08.92230984Z"}
{"log":"2018/03/20 18:35:08 [alert] 23#23: worker process 11486 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:08.979320448Z"}
{"log":"2018/03/20 18:35:13 [alert] 23#23: worker process 11520 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:13.231301619Z"}
{"log":"2018/03/20 18:35:13 [alert] 23#23: worker process 11555 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:13.791207004Z"}
{"log":"2018/03/20 18:35:14 [alert] 23#23: worker process 11589 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:14.156415018Z"}
{"log":"2018/03/20 18:35:14 [alert] 23#23: worker process 11623 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:14.5254932Z"}
{"log":"2018/03/20 18:35:18 [alert] 23#23: worker process 11657 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:18.272065288Z"}
{"log":"2018/03/20 18:35:18 [alert] 23#23: worker process 11692 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:18.53148589Z"}
{"log":"2018/03/20 18:35:19 [alert] 23#23: worker process 11726 exited on signal 11 (core dumped)\n","stream":"stderr","time":"2018-03-20T18:35:19.517368566Z"}

Anything else we need to know:

> tail -f nginx-ingress-controller-3228907275-p158b_kube-system_nginx-ingress-controller-45af43a95ce0ff5989dfe5783d5457214952edc83dc5de1e6745b436c7c5e23b.log | grep 'opentracing' -C3
2018/03/20 17:51:09 [debug] 23719#23719: *680808 http script set $service_name
2018/03/20 17:51:09 [debug] 23719#23719: *680808 post rewrite phase: 4
2018/03/20 17:51:09 [debug] 23719#23719: *680808 generic phase: 5
2018/03/20 17:51:09 [debug] 23719#23719: *680808 extraced opentracing span context from request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 starting opentracing request span for 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 starting opentracing location span for \"/\"(00007FE651FDDD00) in request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 injecting opentracing span context from request 00007FE65CCFD050
2018/03/20 17:51:09 [debug] 23719#23719: *680808 adding opentracing header \"x-b3-traceid:09e429b7d83c3090\" in request 00007FE65CCFD050
2018/03/20 17:51:09 [alert] 23#23: worker process 23719 exited on signal 11 (core dumped)
2018/03/20 17:51:09 [debug] 23753#23753: epoll add event: fd:13 op:1 ev:00002001
2018/03/20 17:51:09 [debug] 23753#23753: epoll add event: fd:14 op:1 ev:00002001

After analysis of the coredump it's visible that the issue in the opentracing insert_header block:

#0  0x0000561139f929ec in ngx_list_push ()
#1  0x00007f7a1adbb6ab in ngx_opentracing::inject_span_context(opentracing::v1::Tracer const&, ngx_http_request_s*, opentracing::v1::SpanContext const&) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#2  0x00007f7a1adbcf81 in ngx_opentracing::OpenTracingRequestInstrumentor::OpenTracingRequestInstrumentor(ngx_http_request_s*, ngx_http_core_loc_conf_s*, ngx_opentracing::opentracing_loc_conf_t*) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#3  0x00007f7a1adbdce6 in ngx_opentracing::on_enter_block(ngx_http_request_s*) () from /etc/nginx/modules/ngx_http_opentracing_module.so
#4  0x000056113a097a33 in ngx_http_core_generic_phase ()
#5  0x000056113a093645 in ngx_http_core_run_phases ()
#6  0x000056113a08f03e in ngx_http_process_request ()
#7  0x000056113a09068c in ngx_http_process_request_line ()
#8  0x000056113a0a6be1 in ngx_epoll_process_events.lto_priv ()
#9  0x000056113a0c45df in ngx_process_events_and_timers ()
#10 0x000056113a0a8405 in ngx_worker_process_cycle ()
#11 0x000056113a0be07f in ngx_spawn_process ()
#12 0x000056113a0a9715 in ngx_master_process_cycle ()
#13 0x0000561139f8a654 in main ()
@aledbf
Copy link
Member

aledbf commented Mar 20, 2018

@rnburn can you help with this?

@aledbf aledbf added the kind/bug Categorizes issue or PR as related to a bug. label Mar 20, 2018
@rnburn
Copy link
Contributor

rnburn commented Mar 20, 2018

Yep. Will take a look.

@aledbf
Copy link
Member

aledbf commented Mar 20, 2018

@trnl thank you for the detailed report.

@trnl
Copy link
Contributor Author

trnl commented Mar 20, 2018

On another cluster, I can't reproduce it, and the only difference which I noticed is the:

> uname -a
Linux ... 3.10.0-693.2.2.el7.x86_64 #1 SMP Sat Sep 9 03:55:24 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

@trnl
Copy link
Contributor Author

trnl commented Mar 24, 2018

@rnburn , maybe I can provide some additional information?

@rnburn
Copy link
Contributor

rnburn commented Mar 26, 2018

@trnl - The information is good. I can tell it's crashing here. I haven't had a chance yet to investigate further but will try to fix this week.

@rnburn
Copy link
Contributor

rnburn commented Mar 28, 2018

I looked into this and I couldn't find any nginx documentation that says you can't modify headers_in like this line of the code does, but I inquired on nginx-devel to see if there's anything I'm missing.

Something like a smaller isolated example that reproduces the problem or a core dump could help to narrow the problem down.

@aledbf
Copy link
Member

aledbf commented Mar 28, 2018

@aledbf
Copy link
Member

aledbf commented Apr 13, 2018

@rnburn any update on this?

@rnburn
Copy link
Contributor

rnburn commented Apr 13, 2018

Hey @aledbf - I have plans to put in a fix based off of this answer I got on the nginx mailing list:
http://mailman.nginx.org/pipermail/nginx-devel/2018-March/011008.html

@rnburn
Copy link
Contributor

rnburn commented May 17, 2018

@aledbf - I put in opentracing-contrib/nginx-opentracing#37 with a fix for this.

The fix required doing some reorganization of the module. I removed the vendor-specific modules in favor of loading in the tracer's dynamically with opentracing_load_tracer from a JSON representation of their configuration.

I can put in a PR to update ingress-nginx to do this.

I think this approach should work better for you since you can use an embedded YAML configuration for the tracer without having any code to worry about handling a tracer's specific options.

@aledbf
Copy link
Member

aledbf commented May 17, 2018

@rnburn thank you for the update, the new json file makes sense.

I can put in a PR to update ingress-nginx to do this.

Sure. First let me update the nginx image (I am updating the modules) (#2537)

@aledbf
Copy link
Member

aledbf commented Jun 21, 2018

@rnburn what's the format of the new json configuration for zipkin

@rnburn
Copy link
Contributor

rnburn commented Jun 21, 2018

@aledbf
Copy link
Member

aledbf commented Jun 21, 2018

@trnl can you test quay.io/aledbf/nginx-ingress-controller:0.382 ?
It contains the fix for the bug you reported.
Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants