Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoy crash when test http3 upgrade #18160

Closed
YaoZengzeng opened this issue Sep 17, 2021 · 7 comments · Fixed by #18694
Closed

envoy crash when test http3 upgrade #18160

YaoZengzeng opened this issue Sep 17, 2021 · 7 comments · Fixed by #18694
Assignees
Labels
area/quic bug quic-mvp Required for QUIC MVP

Comments

@YaoZengzeng
Copy link
Member

I tried to load test the http3 upgrade between two envoy proxies, the deploy model as:

fortio --http1--> client side envoy proxy --http3--> server side envoy proxy --http1--> envoy as application

The config of client side envoy proxy is

admin:
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 9902
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10001
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  host_rewrite_literal: domain1.example.com
                  cluster: service_google
          http_filters:
          - name: envoy.filters.http.router
  clusters:
  - name: service_google
    connect_timeout: 30s
    type: LOGICAL_DNS
    # Comment out the following line to test on v6 networks
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: service_google
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: 127.0.0.1
                port_value: 10000
    typed_extension_protocol_options:
      envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
        "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
        explicit_http_config:
          http3_protocol_options: {}
        common_http_protocol_options:
          idle_timeout: 1s
    transport_socket:
      name: envoy.transport_sockets.quic
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.quic.v3.QuicUpstreamTransport
        upstream_tls_context:
          sni: proxy-postgres-backend.example.com
          common_tls_context:
            validation_context:
              match_subject_alt_names:
              - exact: proxy-postgres-backend.example.com
              trusted_ca:
                filename: certs/cacert.pem

The config of server side envoy proxy is:

admin:
  address:
    socket_address:
      protocol: TCP
      address: 0.0.0.0
      port_value: 9901
static_resources:
  listeners:
  - name: listener_tcp
    address:
      socket_address:
        protocol: TCP
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - transport_socket:
        name: envoy.transport_sockets.tls
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext
          common_tls_context:
            tls_certificates:
            - certificate_chain:
                filename: certs/servercert.pem
              private_key:
                filename: certs/serverkey.pem
      filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: HTTP2
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              response_headers_to_add:
              - header:
                  key: alt-svc
                  value: h3=":10000"; ma=86400, h3-29=":10000"; ma=86400
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  host_rewrite_literal: www.envoyproxy.io
                  cluster: service_envoyproxy_io
          http3_protocol_options:
          http_filters:
          - name: envoy.filters.http.router

  - name: listener_udp
    address:
      socket_address:
        protocol: UDP
        address: 0.0.0.0
        port_value: 10000
    udp_listener_config:
      quic_options: {}
      downstream_socket_config:
        prefer_gro: true
    filter_chains:
    - transport_socket:
        name: envoy.transport_sockets.quic
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.transport_sockets.quic.v3.QuicDownstreamTransport
          downstream_tls_context:
            common_tls_context:
              tls_certificates:
    udp_listener_config:
      quic_options: {}
      downstream_socket_config:
        prefer_gro: true
    filter_chains:
    - transport_socket:
        name: envoy.transport_sockets.quic
        typed_config:
          '@type': type.googleapis.com/envoy.extensions.transport_sockets.quic.v3.QuicDownstreamTransport
          downstream_tls_context:
            common_tls_context:
              tls_certificates:
              - certificate_chain:
                  filename: certs/servercert.pem
                private_key:
                  filename: certs/serverkey.pem
      filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          codec_type: HTTP3
          stat_prefix: ingress_http
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  #host_rewrite_literal: www.google.com
                  cluster: service_envoyproxy_io
          http3_protocol_options:
          http_filters:
          - name: envoy.filters.http.router
  clusters:
  - name: service_envoyproxy_io
    connect_timeout: 30s
    type: LOGICAL_DNS
    # Comment out the following line to test on v6 networks
    dns_lookup_family: V4_ONLY
    lb_policy: ROUND_ROBIN
    load_assignment:
      cluster_name: service_envoyproxy_io
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                #address: www.google.com
                address: 127.0.0.1
                port_value: 8080
    #transport_socket:
    #  name: envoy.transport_sockets.tls
    #  typed_config:
    #    "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
    #    sni: www.google.com

The envoy version is 1.20 and with --concurrency of 1.

If I test with follow command (10 connections with no qps limit):

fortio load -qps -1 -c 10 -t 10s --timeout 120s http://127.0.0.1:10001

Envoy won't crash but the server side proxy use almost 1 core and client side only use 0.1 core, then the result qps is poor.

But for test with 100 connections, the client side envoy would crash, the log as follow:

[2021-09-17 09:52:05.915][16036][info][main] [source/server/server.cc:803] all clusters initialized. initializing init manager
[2021-09-17 09:52:05.915][16036][info][config] [source/server/listener_manager_impl.cc:779] all dependencies initialized. starting workers
[2021-09-17 09:52:05.916][16036][info][main] [source/server/server.cc:822] starting main dispatch loop
[2021-09-17 09:52:11.291][16043][info][quic] [bazel-out/k8-opt/bin/external/com_github_google_quiche/quiche/quic/core/tls_client_handshaker.cc:463] Client: handshake finished
[2021-09-17 09:52:15.917][16036][info][main] [source/server/drain_manager_impl.cc:171] shutting down parent after drain
[2021-09-17 09:52:29.290][16043][info][quic] [bazel-out/k8-opt/bin/external/com_github_google_quiche/quiche/quic/core/tls_client_handshaker.cc:463] Client: handshake finished
[2021-09-17 09:53:59.468][16043][info][quic] [bazel-out/k8-opt/bin/external/com_github_google_quiche/quiche/quic/core/tls_client_handshaker.cc:463] Client: handshake finished
[2021-09-17 09:53:59.719][16043][critical][backtrace] [./source/server/backtrace.h:104] Caught Segmentation fault, suspect faulting address 0x0
[2021-09-17 09:53:59.719][16043][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-09-17 09:53:59.719][16043][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: cc1d41e7ee9fbfb7ee3c8f73724cdc41d7c6bbb0/1.20.0-dev/Clean/RELEASE/BoringSSL
[2021-09-17 09:53:59.720][16043][critical][backtrace] [./source/server/backtrace.h:96] #0: __restore_rt [0x7f7e471c6980]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #1: [0x55c355aa1247]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #2: [0x55c355a445b3]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #3: [0x55c355a47fcc]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #4: [0x55c355a48e34]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #5: [0x55c355a442cf]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #6: [0x55c355c3b9d6]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #7: [0x55c355c41864]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #8: [0x55c355b8496e]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #9: [0x55c355b764a5]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #10: [0x55c355bba357]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #11: [0x55c355bb7af7]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #12: [0x55c355bc017f]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #13: [0x55c355eef897]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #14: [0x55c355bbfbaf]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #15: [0x55c355bb5f44]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #16: [0x55c355bb56df]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #17: [0x55c355bb6185]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #18: [0x55c355b7203c]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #19: [0x55c355e0b6ff]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #20: [0x55c355e04afa]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #21: [0x55c355e02649]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #22: [0x55c355df39b1]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #23: [0x55c355df4c5c]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #24: [0x55c355efbb78]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #25: [0x55c355efa571]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #26: [0x55c35581d362]
[2021-09-17 09:53:59.762][16043][critical][backtrace] [./source/server/backtrace.h:98] #27: [0x55c35613aa73]
[2021-09-17 09:53:59.763][16043][critical][backtrace] [./source/server/backtrace.h:96] #28: start_thread [0x7f7e471bb6db]
ActiveStream 0x5e05bf517e00, stream_id_: 9246261070474838081&filter_manager_: 
  FilterManager 0x5e05bf517e78, state_.has_continue_headers_: 0
  filter_manager_callbacks_.requestHeaders(): 
    ':authority', 'domain1.example.com'
    ':path', '/'
    ':method', 'GET'
    ':scheme', 'http'
    'user-agent', 'fortio.org/fortio-dev'
    'x-forwarded-proto', 'http'
    'x-request-id', '00ce776b-f1ba-43d4-88c8-e10942fd7e69'
    'x-envoy-expected-rq-timeout-ms', '15000'
  filter_manager_callbacks_.requestTrailers():   null
  filter_manager_callbacks_.responseHeaders():   null
  filter_manager_callbacks_.responseTrailers():   null
  &stream_info_: 
    StreamInfoImpl 0x5e05bf517f78, upstream_connection_id_: null, protocol_: 1, response_code_: null, response_code_details_: null, attempt_count_: 1, health_check_request_: 0, route_name_: 
    OverridableRemoteConnectionInfoSetterStreamInfo 0x5e05bf517f78, remoteAddress(): 127.0.0.1:37496, directRemoteAddress(): 127.0.0.1:37496, localAddress(): 127.0.0.1:10001
Http1::ConnectionImpl 0x5e05bf44d508, dispatching_: 1, dispatching_slice_already_drained_: 0, reset_stream_called_: 0, handling_upgrade_: 0, deferred_end_stream_headers_: 1, require_strict_1xx_and_204_headers_: 1, send_strict_1xx_and_204_headers_: 1, processing_trailers_: 0, no_chunked_encoding_header_for_304_: 1, buffered_body_.length(): 0, header_parsing_state_: Done, current_header_field_: , current_header_value_: 
active_request_: 
, request_url_: null, response_encoder_.local_end_stream_: 0
absl::get<RequestHeaderMapPtr>(headers_or_trailers_): null
current_dispatching_buffer_ front_slice length: 76 contents: "GET / HTTP/1.1\r\nHost: 127.0.0.1:10001\r\nUser-Agent: fortio.org/fortio-dev\r\n\r\n"
ConnectionImpl 0x5e05bf465140, connecting_: 0, bind_error_: 0, state(): Open, read_buffer_limit_: 1048576
socket_: 
  ListenSocketImpl 0x5e05bf7a8900, transport_protocol_: raw_buffer
  connection_info_provider_: 
    ConnectionInfoSetterImpl 0x5e05bf735b18, remote_address_: 127.0.0.1:37496, direct_remote_address_: 127.0.0.1:37496, local_address_: 127.0.0.1:10001, server_name_: 
Segmentation fault (core dumped)

Any ideas? cc @alyssawilk :)

@YaoZengzeng YaoZengzeng added bug triage Issue requires triage labels Sep 17, 2021
@alyssawilk
Copy link
Contributor

Just a reminder while QUIC is in alpha we're working hard to get it to GA so please do check in with envoy-security before posting crash reports publicly :-)

That said thanks for the report. Any chance you can get symbols on that stack trace? cc @danzh2010

@alyssawilk alyssawilk added area/quic quic-mvp Required for QUIC MVP and removed triage Issue requires triage labels Sep 20, 2021
@alyssawilk
Copy link
Contributor

Actually I'm going to pull this from the MVP list as the connect config is separately tagged alpha, so I think QUIC in general can GA without connect support being trusted in prod.

@alyssawilk alyssawilk removed the quic-mvp Required for QUIC MVP label Sep 21, 2021
@alyssawilk
Copy link
Contributor

oops, redacting, not an upgrade specific issue I misread

@alyssawilk alyssawilk added the quic-mvp Required for QUIC MVP label Sep 21, 2021
@YaoZengzeng
Copy link
Member Author

[2021-10-07 15:43:59.004][4210][critical][assert] [source/common/quic/codec_impl.cc:89] assert failure: stream != nullptr. Details: Fail to create QUIC stream.
[2021-10-07 15:43:59.004][4210][critical][backtrace] [./source/server/backtrace.h:104] Caught Aborted, suspect faulting address 0x106b
[2021-10-07 15:43:59.004][4210][critical][backtrace] [./source/server/backtrace.h:91] Backtrace (use tools/stack_decode.py to get line numbers):
[2021-10-07 15:43:59.004][4210][critical][backtrace] [./source/server/backtrace.h:92] Envoy version: f90dd0b5525ae3ac6e5af320c353ff3135f4ff06/1.20.0-dev/Clean/DEBUG/BoringSSL
[2021-10-07 15:43:59.023][4210][critical][backtrace] [./source/server/backtrace.h:96] #0: Envoy::SignalAction::sigHandler() [0x55efb4fcae29]
[2021-10-07 15:43:59.023][4210][critical][backtrace] [./source/server/backtrace.h:96] #1: __restore_rt [0x7fa8a05a4980]
[2021-10-07 15:43:59.040][4210][critical][backtrace] [./source/server/backtrace.h:96] #2: Envoy::Http::CodecClient::newStream() [0x55efb4258e14]
[2021-10-07 15:43:59.057][4210][critical][backtrace] [./source/server/backtrace.h:96] #3: Envoy::Http::MultiplexedActiveClientBase::newStreamEncoder() [0x55efb4148add]
[2021-10-07 15:43:59.075][4210][critical][backtrace] [./source/server/backtrace.h:96] #4: Envoy::Http::HttpConnPoolImplBase::onPoolReady() [0x55efb41471ec]
[2021-10-07 15:43:59.092][4210][critical][backtrace] [./source/server/backtrace.h:96] #5: Envoy::ConnectionPool::ConnPoolImplBase::attachStreamToClient() [0x55efb415051e]
[2021-10-07 15:43:59.109][4210][critical][backtrace] [./source/server/backtrace.h:96] #6: Envoy::ConnectionPool::ConnPoolImplBase::newStreamImpl() [0x55efb4152812]
[2021-10-07 15:43:59.127][4210][critical][backtrace] [./source/server/backtrace.h:96] #7: Envoy::Http::HttpConnPoolImplBase::newStream() [0x55efb4146d6e]

From coredump and corresponding code, the reason of this issue is that envoy could not handle quic stream creation failure gracefully. A known issue, so close it.

By the way, I find that the support for http3 is still under heavy deployment, there are still many things to do. And I'd like to make some contributions in my spare time. Are there any suggestions on where to start @alyssawilk ? thanks :)

@alyssawilk alyssawilk reopened this Oct 7, 2021
@alyssawilk
Copy link
Contributor

cc @danzh2010 this is the same issue Chidera hit. I think we need to fix this one.
I'll try to pick it up next week.

@alyssawilk alyssawilk self-assigned this Oct 7, 2021
@danzh2010
Copy link
Contributor

danzh2010 commented Oct 7, 2021

Thanks for the core dump @YaoZengzeng! And welcome to contribute to Envoy HTTP/3. Currently the issues in MVP list already have owner. But if you would like to work on any new features, or fix bugs, feel free to create an issue for discussion first and creating PRs following general Envoy contribution guide. You can refer to the QUIC dev doc for design and implementation overview. And feel free to join envoy-udp-quic-dev channel(channel ID: C9UGU858E) on slack for new announcement and discussion.

@YaoZengzeng
Copy link
Member Author

@danzh2010 Thanks for your advice :)

alyssawilk added a commit that referenced this issue Oct 20, 2021
This is a 99% fix to the too many streams crash: we weren't setting the 100 stream limit in quiche, and the default of 0 for Envoy meant "allow thousands of streams" which resulted in stream creation failure and null pointer deref.
Unfortunately when aligning the stream limits, we still fail a check in quiche where the stream count in Envoy and the stream count in QUICHE don't match up. To overcome this I'm overriding ShouldCreateOutgoingBidirectionalStream which turns a fatal crash into a QUIC_BUG which seems preferable. Added a TODO into sorting out the underlying issue entirely.

Risk Level: low
Testing: new integration test
Part of #18160

Signed-off-by: Alyssa Wilk <[email protected]>
alyssawilk added a commit that referenced this issue Oct 26, 2021
…8694)

Actually fixing a quic stream limit issue. Also fixing an unrelated bug with clean stream shutdown occasionally causing spurious stream-close writes to a closed connection.

Risk Level: High (changing connection pool limits)
Testing: new integration test
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a
Fixes #18160

Signed-off-by: Alyssa Wilk <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quic bug quic-mvp Required for QUIC MVP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants