-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: error_log report error in the way apisix connected to etcd using gRPC #9336
Comments
I will add some configuration related to Etcd in APISIX.
|
I also found the following error in the error log.
|
@kingluo @monkeyDluffy6017 PTAL. We seem to have discussed a similar issue recently. |
Unfortunately, the APISIX service node in my production environment encountered the exact same issue this morning. All APISIX nodes using gRPC are affected by this problem. It seems that this is not an isolated incident. |
Could you use http instead? the grpc is not ready |
Yes, I have already reverted back to using HTTP. As for the issue with gRPC, I mainly wanted to provide feedback here. |
@tao12345666333 @monkeyDluffy6017 I want to say that maybe there are problems with both gRPC and HTTP when connecting APISIX version 3.2.0 to Etcd. The phenomenon is as follows:
The error message of the error log is as follows:
This issue is reproducible and has been reproduced in different APISIX clusters of mine. As long as APISIX runs for a period of time (in both my clusters, it did not exceed 24 hours), it will start to report errors. |
@kingluo PTAL |
@hansedong Even if there is no route, will it still be like this? |
I'm not sure about the scenario without a route. Multiple APISIX clusters on my side all have routes. But I think this problem is probably not related to whether there are routes or not. Regarding how to reproduce this issue. I encountered it after upgrading my APISIX from 2.15.x to 3.2.0 (only upgraded the APISIX data plane node). My APISIX has over 600 routes and over 600 upstreams, I suspect that this issue may also exist if using version 3.2.0 directly. Therefore, I plan to set up a dedicated environment for version 3.2.0 and try to verify this issue. Reproduction method: You can upgrade the APISIX version from 2.15.x to 3.2.0 in your own environment, and link Etcd using gRPC ( |
I tried running APISIX 3.2.0 for a long time and didn't seem to encounter this issue. |
@AlinsRan |
@hansedong |
May I ask if your APISIX communicates with Etcd using TLS? |
@hansedong this issue has been fixed in #8493 |
@TakiJoe Thank you very much for your reminder. I will build an APISIX version based on the source code to verify it. |
I encountered the same problem inAPISIX 3.2.0 |
@hansedong @zxyao145 The bugfix was merged into the master, so have you confirmed if the bugfix works? If so, please close this issue. Thank you. |
In the version I compiled myself, there are still some issues. I will try the latest version and reply later. |
I built APISIX based on the latest code. When I use gRPC to connect to Etcd, I encounter the following error message:
It can be clearly seen that the time elapsed from starting APISIX to encountering the following error is about 1 minute, Exactly speaking, it's 61 seconds!! Here is part of my configuration of APISIX about Etcd:
What needs to be emphasized is that after I changed the timeout configuration item to 30, the above error no longer appeared (observed for several hours without appearing), and as long as the value of timeout exceeds 60, the above error will reappear after 1 minute( Exactly 61 seconds ). Above, I mentioned using gRPC to connect to Etcd and error occurs when the timeout value exceeds 60. If I switch APISIX's connection to Etcd from gRPC back to HTTP (use_grpc: false), there are still issues. Below is the specific error:
Similarly, after testing, no errors will occur when the value of the |
It's another issue, seems network error. Ensure you could access etcd via etcdctl. |
I don't think there is a high possibility of network issues, because my APISIX is a cluster (actually our company's development environment). I am very certain that my Etcd cluster is available, and etcdctl can also access the Etcd cluster. This cluster has been running officially within the company for more than a year. Within the entire cluster, only when using a higher version of APISIX will this error occur. There are no problems with lower versions of APISIX in the cluster. I have conducted a test and found that even if the value of the timeout parameter is set higher than 60 in a lower version of APISIX, no errors will occur. For now, please don't close the issue. I will follow up on this problem in depth from the source code level later. |
we met the same error after upgrade from 3.2 to 3.3(same config, the 3.2 is ok)
and in the same container, use curl to test the |
@hansedong @wklken |
i used the image apache/apisix:3.2.0-centos and i met the same error .
|
@ryanli870929 yes, the bugfix is applied since 3.3.0 |
When I switched the And if I change it back to
|
@wklken "Yes, it's the same as my situation."
@kingluo what is the timeout value used by the internal conf server to connect to Etcd? |
@hansedong The timeout is 60 secs, which reuses the default value of |
Now the HTTP way to access etcd has been refactored and optimized (as well as avoiding the confused timeout logs) in the apisix master branch, which has the same advantages as the GRPC way. Have a try, please. |
I would like to add some information. I compiled RPM based on version 3.4.0 and reinstalled APISIX. Here is my configuration:
Here are some questions that I want to clarify:
It is evident that there are only 4 Nginx processes in total, but each process has a relatively high number of connections with Etcd. The crucial point is that the process 74124 has 35 connections with Etcd, while the processes 74123 and 74126 have only 11 connections each. There is a significant difference between them. I would like to ask if this situation is expected or if there might be an issue in my environment?
From the above, there are a total of 4 Nginx processes. However, each process has a different number of connections with Etcd. For example, process 77907 only have 1 connection, while process 77904 have 2 connections, and the process 77905 have 5 connections. However the process 77906 may not have any connection with Etcd. After a period of time, the number of these ESTABLISHED connections will change, but the number of connections per process remains basically the same and does not change anymore. According to the documentation, each Nginx worker process can only have one long-lived HTTP connection with Etcd, which seems different from what I imagined. According to my understanding, even if a new HTTP connection is rebuilt after the timeout of the previous one, there should actually be only one HTTP persistent connection to meet expectations. I would like to ask what is going on here? @kingluo Can you help me clarify the points of confusion mentioned above? Thank you very much. |
@hansedong It looks like the HTTP connections are not distributed among worker processes evenly. Yes, it's a bug, where the reason is unknown, which exists on Ubuntu 20.04 at least. But it will not cause any issues. In fact, each worker process owns only one connection to etcd. And the server-info plugin will connect to etcd at some interval, so there are some transient connections too, as well as the connection owned by the privileged process is also counted in the work process. |
The etcd.use_grpc is removed, so the etcd communication rolls back to what 2.x versions behave like, i.e. no matter CP or DP, connects to etcd directly via HTTP. Closed by #10015 |
Current Behavior
I am using APISIX version 3.2.0.
The data plane of APISIX connects to etcd through gRPC.
At the beginning everything was running well, but after a period of time (around 8 hours), the error log of APISIX continued to showe the following error message:
After I restarted APISIX, the error message in the log disappeared. Currently, I have encountered this problem twice and each time it was only resolved by restarting APISIX.
Additionally, when APISIX encounters this error, I noticed an increase in traffic to Etcd and it remained at a consistent level. I feel that this is caused by APISIX continuously requesting Etcd.
Additionally, there is a potentially important piece of information: I upgraded from version 2.13.3 to version 3.2.0. Currently, I have only upgraded the APISIX data plane and have not yet upgraded the APISIX control plane or Dashboard. I followed the upgrade documentation for this process. Furthermore, I think that this issue may not be closely related to the upgrade of APISIX since it mainly affects communication between APISIX and Etcd.
Expected Behavior
I would like to know what could be the reason for this issue. If needed, I can try to provide more information.
Error Logs
No response
Steps to Reproduce
I don't know how to reproduce this because after APISIX restarts, the problem disappears. It only reappears after a period of time and cannot be consistently reproduced.
Environment
apisix version
):3.2.0
uname -a
):Linux knode10-72-73-177 5.15.29-200.el7.x86_64 #1 SMP Thu Mar 31 14:09:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
openresty -V
ornginx -V
):curl http://127.0.0.1:9090/v1/server_info
):3.5.4
luarocks --version
):The text was updated successfully, but these errors were encountered: