Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: error_log report error in the way apisix connected to etcd using gRPC #9336

Closed
hansedong opened this issue Apr 19, 2023 · 33 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@hansedong
Copy link
Contributor

Current Behavior

I am using APISIX version 3.2.0.
The data plane of APISIX connects to etcd through gRPC.
At the beginning everything was running well, but after a period of time (around 8 hours), the error log of APISIX continued to showe the following error message:

2023/04/19 10:27:56 [error] 25793#25793: *5597441 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/upstreams, context: ngx.timer
2023/04/19 10:27:56 [error] 25793#25793: *5597452 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/routes, context: ngx.timer
2023/04/19 10:27:56 [error] 25788#25788: *5597471 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/upstreams, context: ngx.timer

After I restarted APISIX, the error message in the log disappeared. Currently, I have encountered this problem twice and each time it was only resolved by restarting APISIX.

Additionally, when APISIX encounters this error, I noticed an increase in traffic to Etcd and it remained at a consistent level. I feel that this is caused by APISIX continuously requesting Etcd.

WeCom20230419-114921

In the picture, the time when Etcd traffic decreased was when I restarted APISIX.

Additionally, there is a potentially important piece of information: I upgraded from version 2.13.3 to version 3.2.0. Currently, I have only upgraded the APISIX data plane and have not yet upgraded the APISIX control plane or Dashboard. I followed the upgrade documentation for this process. Furthermore, I think that this issue may not be closely related to the upgrade of APISIX since it mainly affects communication between APISIX and Etcd.

Expected Behavior

I would like to know what could be the reason for this issue. If needed, I can try to provide more information.

Error Logs

No response

Steps to Reproduce

I don't know how to reproduce this because after APISIX restarts, the problem disappears. It only reappears after a period of time and cannot be consistently reproduced.

Environment

  • APISIX version (run apisix version): 3.2.0
  • Operating system (run uname -a): Linux knode10-72-73-177 5.15.29-200.el7.x86_64 #1 SMP Thu Mar 31 14:09:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V):
nginx version: openresty/1.21.4.1
built by gcc 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
built with OpenSSL 1.1.1s  1 Nov 2022 (running with OpenSSL 1.1.1n  15 Mar 2022)
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -DAPISIX_BASE_VER=1.21.4.1.7 -DNGX_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_HTTP_GRPC_CLI_ENGINE_PATH=/usr/local/openresty/libgrpc_engine.so -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/usr/local/openresty/openssl111/include' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.33 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.09 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.21 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.9 --add-module=../ngx_stream_lua-0.0.11 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-rpath,/usr/local/openresty/wasmtime-c-api/lib -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -L/usr/local/openresty/openssl111/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib:/usr/local/openresty/openssl111/lib' --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../mod_dubbo-1.0.2 --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../ngx_multi_upstream_module-1.1.1 --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../apisix-nginx-module-1.12.0 --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/stream --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../apisix-nginx-module-1.12.0/src/meta --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../wasm-nginx-module-0.6.4 --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../lua-var-nginx-module-v0.5.3 --add-module=/tmp/tmp.xYfSxgSEeh/openresty-1.21.4.1/../grpc-client-nginx-module-v0.4.2 --with-poll_module --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --with-compat --with-stream --with-http_ssl_module
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.4
  • APISIX Dashboard version, if relevant:
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@hansedong
Copy link
Contributor Author

I will add some configuration related to Etcd in APISIX.

  etcd:
    host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "https://10.72.8.66:2379"   # multiple etcd address, if your etcd cluster enables TLS, please use https scheme,
    - "https://10.72.8.67:2379"   # e.g. https://127.0.0.1:2379.
    - "https://10.72.8.68:2379"

    prefix: /apisix                 # configuration prefix in etcd
    use_grpc: true                  # enable the experimental configuration sync via gRPC
    timeout: 7200                     # 30 seconds. Use a much higher timeout (like an hour) if the `use_grpc` is true.
    #resync_delay: 5                # when sync failed and a rest is needed, resync after the configured seconds plus 50% random jitter
    #health_check_timeout: 10       # etcd retry the unhealthy nodes after the configured seconds
    startup_retry: 2                # the number of retry to etcd during the startup, default to 2
    #user: root                     # root username for etcd
    #password: 5tHkHhYkjr6cQY       # root password for etcd
    tls:
      # To enable etcd client certificate you need to build APISIX-Base, see
      # https://apisix.apache.org/docs/apisix/FAQ#how-do-i-build-the-apisix-base-environment
      cert: /etc/etcd/etcdSSL/apisix.pem    # path of certificate used by the etcd client
      key: /etc/etcd/etcdSSL/apisix-key.pem     # path of key used by the etcd client
      verify: true                  # whether to verify the etcd endpoint certificate when setup a TLS connection to etcd,
                                    # the default value is true, e.g. the certificate will be verified strictly.
      sni: etcdservervip-dev.xxx.com                        # the SNI for etcd TLS requests. If missed, the host part of the URL will be used.

@hansedong
Copy link
Contributor Author

I also found the following error in the error log.

It is worth noting that another instance of APISIX (version 2.13.3) using HTTP to connect to the same Etcd does not have any Etcd-related errors.

2023/04/19 13:05:51 [warn] 149396#149396: *16028 [lua] health_check.lua:105: report_failure(): update endpoint: https://10.72.8.68:2379 to unhealthy, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16028 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16029 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16030 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16031 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16032 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16033 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16034 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16035 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16036 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [error] 149396#149396: *16037 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:05:51 [warn] 149395#149395: *206337 [lua] health_check.lua:105: report_failure(): update endpoint: https://10.72.8.67:2379 to unhealthy, context: ngx.timer
2023/04/19 13:07:50 [warn] 149398#149398: *214112 [lua] health_check.lua:105: report_failure(): update endpoint: https://10.72.8.66:2379 to unhealthy, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16026 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16027 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [warn] 149396#149396: *16028 [lua] health_check.lua:105: report_failure(): update endpoint: https://10.72.8.68:2379 to unhealthy, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16028 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16029 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16030 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16031 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16032 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16033 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16034 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16035 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16036 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:51 [error] 149396#149396: *16037 [lua] config_etcd.lua:136: grpc_waitdir(): create watch stream failed: failed to new stream: timeout, context: ngx.timer
2023/04/19 13:07:52 [warn] 149395#149395: *214265 [lua] health_check.lua:105: report_failure(): update endpoint: https://10.72.8.67:2379 to unhealthy, context: ngx.timer

@tao12345666333
Copy link
Member

@kingluo @monkeyDluffy6017 PTAL. We seem to have discussed a similar issue recently.

@hansedong
Copy link
Contributor Author

Unfortunately, the APISIX service node in my production environment encountered the exact same issue this morning. All APISIX nodes using gRPC are affected by this problem. It seems that this is not an isolated incident.

@monkeyDluffy6017
Copy link
Contributor

Could you use http instead? the grpc is not ready

@hansedong
Copy link
Contributor Author

Could you use http instead? the grpc is not ready

Yes, I have already reverted back to using HTTP. As for the issue with gRPC, I mainly wanted to provide feedback here.

@hansedong
Copy link
Contributor Author

@tao12345666333 @monkeyDluffy6017

I want to say that maybe there are problems with both gRPC and HTTP when connecting APISIX version 3.2.0 to Etcd.

The phenomenon is as follows:

  1. I have two sets of APISIX clusters, and I upgraded a part of the nodes in each cluster to version 3.2.0.
  2. Only the control plane was upgraded (I referred to the documentation for upgrading APISIX 3.x.x), and after the upgrade, APISIX forwarding was normal (I confirmed this because I also upgraded two nodes in my production environment).
  3. Yesterday, I reported that there were problems with gRPC when connecting APISIX version 3.2.0 to Etcd, but later on, I changed the connection method from gRPC to HTTP and it worked normally.
  4. Today at dawn, I found that there was abnormal traffic on Etcd used by APISIX, then discovered a large number of continuous error logs.

The error message of the error log is as follows:

2023/04/19 10:27:56 [error] 25793#25793: *5597452 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/routes, context: ngx.timer
2023/04/19 10:27:56 [error] 25788#25788: *5597471 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/upstreams, context: ngx.timer
2023/04/19 10:27:57 [error] 25787#25787: *5597542 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/upstreams, context: ngx.timer
2023/04/19 10:27:58 [error] 25789#25789: *5597586 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/routes, context: ngx.timer
2023/04/19 10:27:58 [error] 25790#25790: *5597612 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/upstreams, context: ngx.timer
2023/04/19 10:28:00 [error] 25788#25788: *5597711 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/routes, context: ngx.timer
2023/04/19 10:28:00 [error] 25787#25787: *5597717 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
	/usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
	/usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
	/usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
	[C]: in function 'xpcall'
	/usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/routes, context: ngx.timer

You can see that when APISIX 3.2.0 encounters an error connecting to Etcd, regardless of whether the connection method is gRPC or HTTP, the error message is basically the same.

This issue is reproducible and has been reproduced in different APISIX clusters of mine. As long as APISIX runs for a period of time (in both my clusters, it did not exceed 24 hours), it will start to report errors.

@tao12345666333
Copy link
Member

@kingluo PTAL

@AlinsRan
Copy link
Contributor

AlinsRan commented Apr 21, 2023

@hansedong Even if there is no route, will it still be like this?
Can you provide the steps for reproduction? I will try to reproduce it.

@hansedong
Copy link
Contributor Author

hansedong commented Apr 21, 2023

@hansedong Even if there is no route, will it still be like this? I will try to reproduce it.

I'm not sure about the scenario without a route. Multiple APISIX clusters on my side all have routes. But I think this problem is probably not related to whether there are routes or not.

Regarding how to reproduce this issue. I encountered it after upgrading my APISIX from 2.15.x to 3.2.0 (only upgraded the APISIX data plane node). My APISIX has over 600 routes and over 600 upstreams, I suspect that this issue may also exist if using version 3.2.0 directly. Therefore, I plan to set up a dedicated environment for version 3.2.0 and try to verify this issue.

Reproduction method: You can upgrade the APISIX version from 2.15.x to 3.2.0 in your own environment, and link Etcd using gRPC (use_grpc: true). That's it. The problem won't appear at first, but wait a few hours or one day and then check the error log.

@AlinsRan
Copy link
Contributor

@hansedong Even if there is no route, will it still be like this? I will try to reproduce it.

I'm not sure about the scenario without a route. Multiple APISIX clusters on my side all have routes. But I think this problem is probably not related to whether there are routes or not.

Regarding how to reproduce this issue. I encountered it after upgrading my APISIX from 2.15.x to 3.2.0 (only upgraded the APISIX data plane node). My APISIX has over 600 routes and over 600 upstreams, I suspect that this issue may also exist if using version 3.2.0 directly. Therefore, I plan to set up a dedicated environment for version 3.2.0 and try to verify this issue.

Reproduction method: You can upgrade the APISIX version from 2.15.x to 3.2.0 in your own environment, and link Etcd using gRPC (use_grpc: true). That's it. The problem won't appear at first, but wait a few hours or one day and then check the error log.

I tried running APISIX 3.2.0 for a long time and didn't seem to encounter this issue.
Can you provide detailed steps to reproduce?

@hansedong
Copy link
Contributor Author

@AlinsRan
Indeed, it is not easy to reproduce this issue.
The last time I encountered it in my cluster, it took 3 days to reproduce. Can you check if the error log of your cluster is normal now?
In addition, this afternoon, I will set up a new cluster to try to see if I can reproduce this issue. If I can reproduce it, I will provide complete cluster information, including the Etcd version, APISIX version, and so on.

@AlinsRan
Copy link
Contributor

@hansedong
I only ran for two days and did not encounter this issue.
I have tried running in multiple environments, such as K8S, Docker, and Virtual Machine, but still haven't encountered it. Perhaps it is due to specific configurations?

@hansedong
Copy link
Contributor Author

@hansedong I only ran for two days and did not encounter this issue. I have tried running in multiple environments, such as K8S, Docker, and Virtual Machine, but still haven't encountered it. Perhaps it is due to specific configurations?

May I ask if your APISIX communicates with Etcd using TLS?

@TakiJoe
Copy link

TakiJoe commented Apr 25, 2023

@hansedong this issue has been fixed in #8493

@hansedong
Copy link
Contributor Author

@TakiJoe Thank you very much for your reminder. I will build an APISIX version based on the source code to verify it.

@zxyao145
Copy link

I encountered the same problem inAPISIX 3.2.0

@kingluo
Copy link
Contributor

kingluo commented May 4, 2023

@hansedong this issue has been fixed in #8493

@hansedong @zxyao145 The bugfix was merged into the master, so have you confirmed if the bugfix works? If so, please close this issue. Thank you.

@hansedong
Copy link
Contributor Author

@hansedong @zxyao145 The bugfix was merged into the master, so have you confirmed if the bugfix works? If so, please close this issue. Thank you.

In the version I compiled myself, there are still some issues. I will try the latest version and reply later.

@hansedong
Copy link
Contributor Author

hansedong commented May 4, 2023

@kingluo

I built APISIX based on the latest code. When I use gRPC to connect to Etcd, I encounter the following error message:

2023/05/04 14:23:50 [warn] 500026#500026: *4 [lua] plugin.lua:202: load(): new plugins: {"udp-logger":true,"client-control":true,"cors":true,"batch-requests":true,"gzip":true,"syslog":true,"authz-keycloak":true,"zipkin":true,"forward-auth":true,"tcp-logger":true,"aws-lambda":true,"skywalking-logger":true,"fault-injection":true,"redirect":true,"http-logger":true,"cas-auth":true,"kafka-logger":true,"openid-connect":true,"kafka-proxy":true,"authz-casdoor":true,"loggly":true,"uri-blocker":true,"key-auth":true,"azure-functions":true,"limit-conn":true,"basic-auth":true,"openfunction":true,"public-api":true,"datadog":true,"clickhouse-logger":true,"opa":true,"proxy-control":true,"proxy-cache":true,"csrf":true,"elasticsearch-logger":true,"limit-req":true,"ext-plugin-post-resp":true,"ext-plugin-post-req":true,"server-info":true,"jwt-auth":true,"openwhisk":true,"degraphql":true,"traffic-split":true,"log-rotate":true,"google-cloud-logging":true,"wolf-rbac":true,"inspect":true,"api-breaker":true,"real-ip":true,"echo":true,"rocketmq-logger":true,"file-logger":true,"node-status":true,"mocking":true,"body-transformer":true,"sls-logger":true,"splunk-hec-logging":true,"consumer-restriction":true,"ext-plugin-pre-req":true,"referer-restriction":true,"ua-restriction":true,"workflow":true,"authz-casbin":true,"ip-restriction":true,"request-validation":true,"hmac-auth":true,"grpc-web":true,"response-rewrite":true,"example-plugin":true,"limit-count":true,"proxy-rewrite":true,"serverless-pre-function":true,"prometheus":true,"request-id":true,"grpc-transcode":true,"proxy-mirror":true,"ldap-auth":true,"tencent-cloud-cls":true,"serverless-post-function":true,"ai":true}, context: init_worker_by_lua*
2023/05/04 14:23:50 [warn] 500025#500025: *2 [lua] plugin.lua:252: load_stream(): new plugins: {"prometheus":true,"syslog":true,"mqtt-proxy":true,"limit-conn":true,"ip-restriction":true}, context: init_worker_by_lua*
2023/05/04 14:23:50 [warn] 500027#500027: *3 [lua] plugin.lua:252: load_stream(): new plugins: {"prometheus":true,"syslog":true,"mqtt-proxy":true,"limit-conn":true,"ip-restriction":true}, context: init_worker_by_lua*
2023/05/04 14:23:50 [warn] 500026#500026: *4 [lua] plugin.lua:252: load_stream(): new plugins: {"prometheus":true,"syslog":true,"mqtt-proxy":true,"limit-conn":true,"ip-restriction":true}, context: init_worker_by_lua*
2023/05/04 14:23:50 [warn] 500024#500024: *5 [lua] plugin.lua:252: load_stream(): new plugins: {"prometheus":true,"syslog":true,"mqtt-proxy":true,"limit-conn":true,"ip-restriction":true}, context: init_worker_by_lua*
2023/05/04 14:23:50 [warn] 500031#500031: *1 [lua] plugin.lua:252: load_stream(): new plugins: {"prometheus":true,"syslog":true,"mqtt-proxy":true,"limit-conn":true,"ip-restriction":true}, context: init_worker_by_lua*


2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.67:2379", host: "localhost"
2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.68:2379", host: "localhost"
2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.67:2379", host: "localhost"
2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.66:2379", host: "localhost"
2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.66:2379", host: "localhost"
2023/05/04 14:24:51 [error] 500025#500025: *19 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /etcdserverpb.Watch/Watch HTTP/2.0", upstream: "grpcs://10.72.8.68:2379", host: "localhost"

It can be clearly seen that the time elapsed from starting APISIX to encountering the following error is about 1 minute, Exactly speaking, it's 61 seconds!!

Here is part of my configuration of APISIX about Etcd:

  etcd:
    host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "https://10.72.8.66:2379"   # multiple etcd address, if your etcd cluster enables TLS, please use https scheme,
    - "https://10.72.8.67:2379"   # e.g. https://127.0.0.1:2379.
    - "https://10.72.8.68:2379"
    prefix: /apisix                 # configuration prefix in etcd
    use_grpc: true                  # enable the experimental configuration sync via gRPC
    timeout: 300                     # 30 seconds. Use a much higher timeout (like an hour) if the `use_grpc` is true.
    #resync_delay: 5                # when sync failed and a rest is needed, resync after the configured seconds plus 50% random jitter
    #health_check_timeout: 10       # etcd retry the unhealthy nodes after the configured seconds
    startup_retry: 2                # the number of retry to etcd during the startup, default to 2
    #user: root                     # root username for etcd
    #password: 5tHkHhYkjr6cQY       # root password for etcd

What needs to be emphasized is that after I changed the timeout configuration item to 30, the above error no longer appeared (observed for several hours without appearing), and as long as the value of timeout exceeds 60, the above error will reappear after 1 minute( Exactly 61 seconds ).

Above, I mentioned using gRPC to connect to Etcd and error occurs when the timeout value exceeds 60. If I switch APISIX's connection to Etcd from gRPC back to HTTP (use_grpc: false), there are still issues. Below is the specific error:

2023/05/04 14:58:42 [error] 503639#503639: *69 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.67:2379/v3/watch", host: "127.0.0.1"
2023/05/04 14:58:42 [error] 503639#503639: *73 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.68:2379/v3/watch", host: "127.0.0.1"
2023/05/04 14:58:42 [error] 503639#503639: *79 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.67:2379/v3/watch", host: "127.0.0.1"
2023/05/04 14:58:42 [error] 503645#503645: *6 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /apisix/plugin_metadata, context: ngx.timer
2023/05/04 14:58:42 [error] 503645#503645: *10 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /apisix/ssl, context: ngx.timer
2023/05/04 14:58:42 [error] 503639#503639: *77 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.67:2379/v3/watch", host: "127.0.0.1"
2023/05/04 14:58:42 [error] 503645#503645: *16 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /apisix/plugin_configs, context: ngx.timer
2023/05/04 14:58:42 [error] 503639#503639: *37 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.66:2379/v3/watch", host: "127.0.0.1"
2023/05/04 14:58:42 [error] 503645#503645: *14 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /apisix/services, context: ngx.timer
2023/05/04 14:58:42 [error] 503645#503645: *4 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /apisix/protos, context: ngx.timer
2023/05/04 14:58:42 [error] 503639#503639: *81 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "https://10.72.8.68:2379/v3/watch", host: "127.0.0.1"

Similarly, after testing, no errors will occur when the value of the timeout parameter is below 60.

@kingluo
Copy link
Contributor

kingluo commented May 4, 2023

It's another issue, seems network error. Ensure you could access etcd via etcdctl.

@hansedong
Copy link
Contributor Author

It's another issue, seems network error. Ensure you could access etcd via etcdctl.

I don't think there is a high possibility of network issues, because my APISIX is a cluster (actually our company's development environment). I am very certain that my Etcd cluster is available, and etcdctl can also access the Etcd cluster. This cluster has been running officially within the company for more than a year.

Within the entire cluster, only when using a higher version of APISIX will this error occur. There are no problems with lower versions of APISIX in the cluster. I have conducted a test and found that even if the value of the timeout parameter is set higher than 60 in a lower version of APISIX, no errors will occur.

For now, please don't close the issue. I will follow up on this problem in depth from the source code level later.

@wklken
Copy link

wklken commented May 10, 2023

we met the same error after upgrade from 3.2 to 3.3(same config, the 3.2 is ok)

2023/05/10 04:16:12 [error] 584#584: *263473 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "http://x.x.x.x:2379/v3/watch", host: "127.0.0.1"

and in the same container, use curl to test the /v3/watch, no timed out error

@kingluo
Copy link
Contributor

kingluo commented May 11, 2023

@hansedong @wklken
In fact, the timeout error logs are expected.
Since apisix 2.15, the connection between apisix and etcd is proxied via an internal conf server, which uses nginx upstream but not resty cosocket to connect etcd. It's configured with a timeout value, when it timed out, the nginx will close the connection on its own and print log.

@ryanli870929
Copy link

ryanli870929 commented May 16, 2023

i used the image apache/apisix:3.2.0-centos and i met the same error .

2023/05/15 13:12:03 [warn] 50#50: *736693 [lua] config_etcd.lua:395: sync_data(): waitdir [/apisix/plugin_metadata] err: compacted, will read the configuration again via readdir, context: ngx.timer
2023/05/15 13:12:36 [warn] 47#47: *725726 [lua] config_etcd.lua:395: sync_data(): waitdir [/apisix/ssls] err: compacted, will read the configuration again via readdir, context: ngx.timer
2023/05/15 13:12:36 [error] 47#47: *725726 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/ssls, context: ngx.timer
2023/05/15 13:12:36 [warn] 49#49: *725746 [lua] config_etcd.lua:395: sync_data(): waitdir [/apisix/ssls] err: compacted, will read the configuration again via readdir, context: ngx.timer
2023/05/15 13:12:36 [warn] 54#54: *725750 [lua] config_etcd.lua:395: sync_data(): waitdir [/apisix/ssls] err: compacted, will read the configuration again via readdir, context: ngx.timer
2023/05/15 13:12:36 [error] 49#49: *725746 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/ssls, context: ngx.timer
2023/05/15 13:12:36 [error] 54#54: *725750 [lua] config_etcd.lua:659: failed to fetch data from etcd: /usr/local/apisix/apisix/core/config_util.lua:99: attempt to index local 'item' (a boolean value)
stack traceback:
        /usr/local/apisix/apisix/core/config_util.lua:99: in function 'fire_all_clean_handlers'
        /usr/local/apisix/apisix/core/config_etcd.lua:376: in function 'sync_data'
        /usr/local/apisix/apisix/core/config_etcd.lua:606: in function </usr/local/apisix/apisix/core/config_etcd.lua:596>
        [C]: in function 'xpcall'
        /usr/local/apisix/apisix/core/config_etcd.lua:596: in function </usr/local/apisix/apisix/core/config_etcd.lua:574>,  etcd key: /apisix/ssls, context: ngx.timer

@kingluo
Copy link
Contributor

kingluo commented May 16, 2023

@ryanli870929 yes, the bugfix is applied since 3.3.0

@wklken
Copy link

wklken commented May 18, 2023

@hansedong @wklken In fact, the timeout error logs are expected. Since apisix 2.15, the connection between apisix and etcd is proxied via an internal conf server, which uses nginx upstream but not resty cosocket to connect etcd. It's configured with a timeout value, when it timed out, the nginx will close the connection on its own and print log.

When I switched the deployment.etcd.timeout from 120 to 30, the error disappeared.

And if I change it back to 120, the timeout error logs showed up 100% after reloading.

2023/05/18 08:10:58 [error] 837#837: *59888 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "http://x.x.x.x:2379/v3/watch", host: "127.0.0.1"
2023/05/18 08:10:58 [error] 841#841: *59837 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /bk-gateway-0-apisix/global_rules, context: ngx.timer

@hansedong
Copy link
Contributor Author

When I switched the deployment.etcd.timeout from 120 to 30, the error disappeared.

And if I change it back to 120, the timeout error logs showed up 100% after reloading.

2023/05/18 08:10:58 [error] 837#837: *59888 upstream timed out (110: Connection timed out) while reading upstream, client: unix:, server: , request: "POST /v3/watch HTTP/1.1", upstream: "http://x.x.x.x:2379/v3/watch", host: "127.0.0.1"
2023/05/18 08:10:58 [error] 841#841: *59837 [lua] config_etcd.lua:634: failed to fetch data from etcd: closed,  etcd key: /bk-gateway-0-apisix/global_rules, context: ngx.timer

@wklken "Yes, it's the same as my situation."

In fact, the timeout error logs are expected.
Since apisix 2.15, the connection between apisix and etcd is proxied via an internal conf server, which uses nginx upstream but not resty cosocket to connect etcd. It's configured with a timeout value, when it timed out, the nginx will close the connection on its own and print log.

@kingluo what is the timeout value used by the internal conf server to connect to Etcd?

@kingluo
Copy link
Contributor

kingluo commented Jun 10, 2023

@hansedong The timeout is 60 secs, which reuses the default value of proxy_read_timeout.

@kingluo
Copy link
Contributor

kingluo commented Jun 10, 2023

Now the HTTP way to access etcd has been refactored and optimized (as well as avoiding the confused timeout logs) in the apisix master branch, which has the same advantages as the GRPC way. Have a try, please.

@lingsamuel lingsamuel added the bug Something isn't working label Jun 20, 2023
@lingsamuel lingsamuel moved this to 📋 Backlog in Apache APISIX backlog Jun 26, 2023
@hansedong
Copy link
Contributor Author

hansedong commented Jul 4, 2023

I would like to add some information. I compiled RPM based on version 3.4.0 and reinstalled APISIX.

Here is my configuration:

deployment:
  role: traditional
  role_traditional:
    config_provider: etcd
  admin:
    # Default token when use API to call for Admin API.
    # *NOTE*: Highly recommended to modify this value to protect APISIX's Admin API.
    # Disabling this configuration item means that the Admin API does not
    # require any authentication.
    admin_key:
      -
        name: admin
        key: xxx
        role: admin                 # admin: manage all configuration data
                                    # viewer: only can view configuration data
      -
        name: viewer
        key: xxx
        role: viewer

    enable_admin_cors: true         # Admin API support CORS response headers.
    allow_admin:                    # http://nginx.org/en/docs/http/ngx_http_access_module.html#allow
      - 0.0.0.0/0                # If we don't set any IP list, then any IP access is allowed by default.
      #- "::/64"
    admin_listen:                 # use a separate port
      ip: 0.0.0.0                 # Specific IP, if not set, the default value is `0.0.0.0`.
      port: 9180                  # Specific port, which must be different from node_listen's port.

    #https_admin: true            # enable HTTPS when use a separate port for Admin API.
                                  # Admin API will use conf/apisix_admin_api.crt and conf/apisix_admin_api.key as certificate.

    admin_api_mtls:               # Depends on `admin_listen` and `https_admin`.
      admin_ssl_cert: ""          # Path of your self-signed server side cert.
      admin_ssl_cert_key: ""      # Path of your self-signed server side key.
      admin_ssl_ca_cert: ""       # Path of your self-signed ca cert.The CA is used to sign all admin api callers' certificates.

    admin_api_version: v3         # The version of admin api, latest version is v3.

  etcd:
    host:                           # it's possible to define multiple etcd hosts addresses of the same etcd cluster.
    - "https://10.72.8.66:2479"   # multiple etcd address, if your etcd cluster enables TLS, please use https scheme,
    - "https://10.72.8.67:2479"   # e.g. https://127.0.0.1:2379.
    - "https://10.72.8.68:2479"

Here are some questions that I want to clarify:

  1. When I use gRPC to connect to Etcd (configuration: use_grpc: true), I expect this method to reduce the number of connections between APISIX and Etcd. However, in reality, I found that for a single Nginx instance, there are still many gRPC connections between it and Etcd
[root@knode10-72-8-63 logs]# ps -ef |grep nginx |grep worker
nobody     74123   74122  0 17:19 ?        00:00:01 nginx: worker process
nobody     74124   74122  0 17:19 ?        00:00:01 nginx: worker process
nobody     74125   74122  0 17:19 ?        00:00:01 nginx: worker process
nobody     74126   74122  0 17:19 ?        00:00:01 nginx: worker process
[root@knode10-72-8-63 logs]# netstat -apn |grep ESTABLISHED |grep nginx|grep -v unix|grep 2479
tcp        0      0 10.72.8.63:52689        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52695        10.72.8.67:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:64456        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:53235        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40532        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52685        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52681        10.72.8.67:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:52751        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52679        10.72.8.67:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:51521        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40604        10.72.8.66:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:40588        10.72.8.66:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:40544        10.72.8.66:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:64418        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40526        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64430        10.72.8.68:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:40522        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:41104        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52713        10.72.8.67:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:52723        10.72.8.67:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:40524        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52741        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64496        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64498        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40626        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40610        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64440        10.72.8.68:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:40618        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52671        10.72.8.67:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:40554        10.72.8.66:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:40596        10.72.8.66:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:52711        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64458        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64422        10.72.8.68:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:39390        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64410        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64412        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40608        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52667        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40546        10.72.8.66:2479         ESTABLISHED 74126/nginx: worker
tcp        0      0 10.72.8.63:64488        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64474        10.72.8.68:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:52701        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64472        10.72.8.68:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:64506        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64444        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64484        10.72.8.68:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:64450        10.72.8.68:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52749        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52709        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:52715        10.72.8.67:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:40534        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40578        10.72.8.66:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:40598        10.72.8.66:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:64482        10.72.8.68:2479         ESTABLISHED 74123/nginx: worker
tcp        0      0 10.72.8.63:52699        10.72.8.67:2479         ESTABLISHED 74124/nginx: worker
tcp        0      0 10.72.8.63:64446        10.72.8.68:2479         ESTABLISHED 74126/nginx: worker

It is evident that there are only 4 Nginx processes in total, but each process has a relatively high number of connections with Etcd. The crucial point is that the process 74124 has 35 connections with Etcd, while the processes 74123 and 74126 have only 11 connections each. There is a significant difference between them. I would like to ask if this situation is expected or if there might be an issue in my environment?

  1. In APISIX 3.4.0 version, I noticed that it is possible to use an HTTP long connection to connect Etcd (feat(config_etcd): use a single long http connection to watch all resources #9456). Indeed, compared to using gRPC, the number of connections has been greatly reduced.
[root@knode10-72-8-63 logs]# ps -ef |grep nginx |grep worker
nobody     77904   77903  0 17:49 ?        00:00:00 nginx: worker process
nobody     77905   77903  0 17:49 ?        00:00:00 nginx: worker process
nobody     77906   77903  0 17:49 ?        00:00:00 nginx: worker process
nobody     77907   77903  0 17:49 ?        00:00:00 nginx: worker process
[root@knode10-72-8-63 logs]# netstat -apn |grep nginx|grep -v unix|grep 2479
tcp        0      0 10.72.8.63:39949        10.72.8.66:2479         ESTABLISHED 77905/nginx: worker
tcp        0      0 10.72.8.63:62609        10.72.8.68:2479         ESTABLISHED 77904/nginx: worker
tcp        0      0 10.72.8.63:63719        10.72.8.68:2479         ESTABLISHED 77905/nginx: worker
tcp        0      0 10.72.8.63:49646        10.72.8.67:2479         ESTABLISHED 77905/nginx: worker
tcp        0      0 10.72.8.63:50858        10.72.8.67:2479         ESTABLISHED 77904/nginx: worker
tcp        0      0 10.72.8.63:39839        10.72.8.66:2479         ESTABLISHED 77907/nginx: worker
tcp        0      0 10.72.8.63:63829        10.72.8.68:2479         ESTABLISHED 77905/nginx: worker
tcp        0      0 10.72.8.63:39835        10.72.8.66:2479         ESTABLISHED 77905/nginx: worker

From the above, there are a total of 4 Nginx processes. However, each process has a different number of connections with Etcd.

For example, process 77907 only have 1 connection, while process 77904 have 2 connections, and the process 77905 have 5 connections. However the process 77906 may not have any connection with Etcd. After a period of time, the number of these ESTABLISHED connections will change, but the number of connections per process remains basically the same and does not change anymore.

According to the documentation, each Nginx worker process can only have one long-lived HTTP connection with Etcd, which seems different from what I imagined. According to my understanding, even if a new HTTP connection is rebuilt after the timeout of the previous one, there should actually be only one HTTP persistent connection to meet expectations. I would like to ask what is going on here?

@kingluo Can you help me clarify the points of confusion mentioned above? Thank you very much.

@kingluo
Copy link
Contributor

kingluo commented Jul 4, 2023

@hansedong
Do not enable use_grpc, it's not stable for the time being.

It looks like the HTTP connections are not distributed among worker processes evenly. Yes, it's a bug, where the reason is unknown, which exists on Ubuntu 20.04 at least. But it will not cause any issues. In fact, each worker process owns only one connection to etcd.

And the server-info plugin will connect to etcd at some interval, so there are some transient connections too, as well as the connection owned by the privileged process is also counted in the work process.

@kingluo
Copy link
Contributor

kingluo commented Sep 3, 2023

The etcd.use_grpc is removed, so the etcd communication rolls back to what 2.x versions behave like, i.e. no matter CP or DP, connects to etcd directly via HTTP. Closed by #10015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

10 participants