Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP health checker does not gracefully handle HTTP2 server shutdown (GOAWAY with NO_ERROR) #13544

Closed
mpuncel opened this issue Oct 13, 2020 · 0 comments · Fixed by #13599
Closed

Comments

@mpuncel
Copy link
Contributor

mpuncel commented Oct 13, 2020

Title: HTTP health checker does not gracefully handle HTTP2 server shutdown (GOAWAY with NO_ERROR)

Description:
The HTTP health checker does not gracefully handle when an HTTP2 server performs the graceful drain sequence. GOAWAY is ignored (no callbacks registered) and then health checks on the old connection fail with REFUSED_STREAM until the connection is forcibly closed by the upstream, at which point the health checker reconnects.

Repro steps:
I've reproduced with this Envoy bootstrap config:

admin:
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 8082
  access_log_path: '/tmp/envoytesting/admin.log'

dynamic_resources:
  lds_config:
    path: /tmp/envoytesting/lds.yaml

static_resources:
  clusters:
  - name: self_health_checked
    connect_timeout: 5s
    type: STATIC
    lb_policy: RANDOM
    http2_protocol_options: {}
    health_checks:
    - interval: "1s"
      no_traffic_interval: 1s
      timeout: 5s
      unhealthy_threshold: 1
      healthy_threshold: 1
      http_health_check:
        path: /_status
        codec_client_type: HTTP2
      always_log_health_check_failures: true
      event_log_path: /tmp/envoytesting/healthcheck.log
    load_assignment:
      cluster_name: self_health_checked
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address: { address: 127.0.0.1, port_value: 8081 }
    http2_protocol_options: {}
node:
  id: foo
  cluster: foo

And this in /tmp/envoytesting/lds.yaml

version_info: "http_stats2"
type_url: type.googleapis.com/envoy.config.listener.v3.Listener
resources:
- "@type": type.googleapis.com/envoy.config.listener.v3.Listener
  name: http_listener
  address:
    socket_address:
      address: 127.0.0.1
      port_value: 8081
  filter_chains:
  - filters:
    - name: envoy.filters.network.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: http_stats2
        http_filters:
        - name: health
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.health_check.v3.HealthCheck
            pass_through_mode: false
            headers:
              name: :path
              exact_match: /_status
        - name: router
          typed_config:
            "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
        route_config:
          virtual_hosts:
          - name: "only_host"
            domains: ["*"]
            routes:
            - match:
                prefix: "/"
              direct_response:
                status: 200
                body:
                  inline_string: "direct response"

Triggering an lds update by updating something like the stat prefix and mv over lds.yaml causes health checks to fail.

Logs from running Envoy with --component-log-level connection:debug,http2:debug,client:debug,hc:debug:


[2020-10-13 15:24:20.877][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:295] [C1] hc response=200 health_flags=healthy
[2020-10-13 15:24:20.877][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 0
[2020-10-13 15:24:21.229][69][info][upstream] [external/envoy/source/server/lds_api.cc:79] lds: add/update listener 'http_listener'
[2020-10-13 15:24:21.877][77][debug][http2] [external/envoy/source/common/http/http2/codec_impl_legacy.cc:813] [C2] sent goaway code=0
[2020-10-13 15:24:21.877][77][debug][http2] [external/envoy/source/common/http/http2/codec_impl_legacy.cc:933] [C2] stream closed: 0
[2020-10-13 15:24:21.877][69][debug][client] [external/envoy/source/common/http/codec_client.cc:109] [C1] response complete
[2020-10-13 15:24:21.877][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:295] [C1] hc response=200 health_flags=healthy
[2020-10-13 15:24:21.877][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 0
[2020-10-13 15:24:22.878][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:22.878][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:22.878][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=healthy
[2020-10-13 15:24:23.879][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:23.880][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:23.880][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=/failed_active_hc
[2020-10-13 15:24:24.880][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:24.881][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:24.881][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=/failed_active_hc
[2020-10-13 15:24:25.883][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:25.883][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:25.883][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=/failed_active_hc
[2020-10-13 15:24:26.879][77][debug][http2] [external/envoy/source/common/http/http2/codec_impl_legacy.cc:813] [C2] sent goaway code=0
[2020-10-13 15:24:26.880][77][debug][connection] [external/envoy/source/common/network/connection_impl.cc:106] [C2] closing data_to_write=17 type=2
[2020-10-13 15:24:26.880][77][debug][connection] [external/envoy/source/common/network/connection_impl_base.cc:40] [C2] setting delayed close timer with timeout 1000 ms
[2020-10-13 15:24:26.880][77][debug][connection] [external/envoy/source/common/network/connection_impl.cc:641] [C2] write flush complete
[2020-10-13 15:24:26.882][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:26.882][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:26.882][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=/failed_active_hc
[2020-10-13 15:24:27.884][77][debug][connection] [external/envoy/source/common/network/connection_impl_base.cc:54] [C2] triggered delayed close
[2020-10-13 15:24:27.884][77][debug][connection] [external/envoy/source/common/network/connection_impl.cc:202] [C2] closing socket: 1
[2020-10-13 15:24:27.884][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:960] [C1] stream closed: 7
[2020-10-13 15:24:27.884][69][debug][client] [external/envoy/source/common/http/codec_client.cc:119] [C1] request reset
[2020-10-13 15:24:27.885][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:287] [C1] connection/stream error health_flags=/failed_active_hc
[2020-10-13 15:24:27.885][69][debug][connection] [external/envoy/source/common/network/connection_impl.cc:579] [C1] remote close
[2020-10-13 15:24:27.885][69][debug][connection] [external/envoy/source/common/network/connection_impl.cc:202] [C1] closing socket: 0
[2020-10-13 15:24:27.885][69][debug][client] [external/envoy/source/common/http/codec_client.cc:95] [C1] disconnect. resetting 0 pending requests
[2020-10-13 15:24:28.887][69][debug][client] [external/envoy/source/common/http/codec_client.cc:39] [C3] connecting
[2020-10-13 15:24:28.887][69][debug][connection] [external/envoy/source/common/network/connection_impl.cc:755] [C3] connecting to 127.0.0.1:8081
[2020-10-13 15:24:28.887][69][debug][connection] [external/envoy/source/common/network/connection_impl.cc:771] [C3] connection in progress
[2020-10-13 15:24:28.887][69][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:1168] [C3] updating connection-level initial window size to 268435456
[2020-10-13 15:24:28.888][69][debug][connection] [external/envoy/source/common/network/connection_impl.cc:611] [C3] connected
[2020-10-13 15:24:28.888][69][debug][client] [external/envoy/source/common/http/codec_client.cc:77] [C3] connected
[2020-10-13 15:24:28.888][77][debug][http2] [external/envoy/source/common/http/http2/codec_impl_legacy.cc:1139] [C4] updating connection-level initial window size to 268435456
[2020-10-13 15:24:28.888][77][debug][http2] [external/envoy/source/common/http/http2/codec_impl_legacy.cc:933] [C4] stream closed: 0
[2020-10-13 15:24:28.889][69][debug][client] [external/envoy/source/common/http/codec_client.cc:109] [C3] response complete
[2020-10-13 15:24:28.889][69][debug][hc] [external/envoy/source/common/upstream/health_checker_impl.cc:295] [C3] hc response=200 health_flags=/failed_active_hc

You can see this sequence:

  1. health checks are succeeding
  2. the server says sent goaway code=0
  3. the in-flight health check succeeds one last time
  4. a sequence of "stream closed: 7" (7 is REFUSED_STREAM) causing health checks to fail
  5. server says closing socket: 1, client says remote close
  6. client reconnects and health checks start succeeding again.

I'm working on a PR to make the http health checker handle GOAWAY similarly to how the gRPC one does, and will test with my same local repro.

@mpuncel mpuncel added bug triage Issue requires triage labels Oct 13, 2020
@mpuncel mpuncel changed the title HTTP health checker does not gracefully handle server shutdown (GOAWAY with NO_ERROR) HTTP health checker does not gracefully handle HTTP2 server shutdown (GOAWAY with NO_ERROR) Oct 13, 2020
@htuch htuch added area/health_checking area/http and removed triage Issue requires triage labels Oct 14, 2020
@mattklein123 mattklein123 added the help wanted Needs help! label Oct 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants