-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Selenium nodes stop connecting to the grid after some time has passed. #1913
Comments
@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
This normally happens because those Nodes are in a "new" section of the cluster, and the Distributor's DNS has no information about those IP addresses. Can you double check that? |
But after restarting the distributor, it once again starts "seeing" the nodes created after its restart.
|
Yes, because the DNS information is updated, I believe. You know better your environment, can you check that? |
While studying the issue, I discovered an interesting feature: if at least one node is connected to the grid, the distributor continues to connect newly created nodes correctly(However, this has not yet been confirmed over an extended period of time.). Also, I don't understand where (at the network level) the distributor gets the list of nodes from, as EventBus doesn't write anything to it, as practice has shown. The node sends information to EventBus, and after that, I have a gap in understanding what happens. Could you please explain in more detail what is happening? |
Well, based on this description, it's not clear. Does he himself have to go there? "Interacts" is too loose of an interpretation. |
This is the part I linked:
|
There is definitely no DNS in it. The externalUri confirms this. It's an IP address there, not a DNS name. Let's go back to my hypothesis: is it possible that the Distributor at some point stops "searching" for nodes because there are none and no longer resumes its search? |
The hypothesis has been confirmed. If there is at least one node in the grid, the distributor continues to reliably register new nodes. Testing was conducted from the moment of the last comment. Thank you for the responses. |
There is also I am not 100% sure about the hypothesis because a Distributor registers a Node if it can reach it via HTTP. It might be that, for some reason, in your environment it takes longer than 2 minutes for the message to reach the Distributor. |
But then how to explain that everything is working now? |
There is nothing in the code that can confirm that hypothesis, that is why I'm not sure about it. What I'm sure is that other people have reported very similar issues and it is due to network connectivity between the Distributor and the Nodes. I cannot troubleshoot your environment. |
I can also report this happened on our deployment which is similar to what have been reported, we use NodePools and the issue started when a new node was added to the pool and chrome-pods where deployed on it while the rest of the components were on the first node. Restarting also did the trick and solved it for now... |
And happened again today when a new node was added to the cluster's nodepool |
I apologize for being off-topic, but what do you mean by "NodePools"? Are you referring to Selenium nodes? And how can this be done? |
Hi @Doofus100500 the nodepools are GKE (Google kubernetes). THis might require to dive a little bit in to k8s. We have GKE nodepool with is configured using k8s autoscaler. Basically this would a valid use case for selenium with keda and k8s autoscaler. Otherwise just keep as many as possible chrome instances and that's it 😃 |
Read through all comments. I have something want to add.
I saw by default These 2 above will be available in new release images tag and chart version on top of SE 4.18.0 |
Thank you for the new unquestionably useful features, but my issue is not related to the node, but to the distributor. After its reboot, nodes immediately register successfully. In other words, if you start the grid and do not connect any nodes to it, after some time, the distributor stops accepting requests from nodes. |
Yes, I think we can observe any endpoint or signal to check health of Distributor then we can rely on that and implement the Liveness probe to take action restart the container if it could not recover itself in a period. |
Is it expected that SE_GRID_URL in the nodeProbe.sh script is empty for SE_NODE_GRID_URL pointing to ingress hostname? I get infinite "Node ID: ${NODE_ID} is not found in the Grid. The registration could be in progress.". (Selenium Helm Chart v0.28) |
Hi, may I know your SE_NODE_GRID_URL value is rendered in your deployment? |
It points to hostname and path defined in the ingress.hostname and ingress.path. Protocol is probably derived from tls.enabled state (I use tls.enabled=true). Example value: https://se-grid.mycompany.com/selenium For version 0.27 everything works fine. |
Thank you for your feedback, there was bug in nodeProbe.sh actually. In the meantime you can workaround by disabling startup probe in node. I will give a patch ASAP. |
@aafeltowicz, chart |
v0.28.1 works like a charm, thx :) BTW I forgot to mention that I also have to set global.K8S_PUBLIC_IP to external host DNS, to make this setup working, otherwise nodes have problems with communication with other components. |
I don't know if the original issue has been resolved recently. However, a proactive approach via liveness probe in K8s to check Distributor is healthy and restart it if there is no request session in queue picked up via PR #2272. |
We are closing this since there were few improvements in the grid core implementation and deployment layer. Recommend to use Grid components v4.25.0+ |
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
What happened?
Selenium nodes stop connecting to the grid after a certain period of time(one night), but if the distributor is restarted at that moment, they start connecting again. We are scaling Nodes deployment's by KEDA 0->N. Useful information couldn't be found in the logs. Could you please suggest where to look?
Command used to start Selenium Grid with Docker
Relevant log output
Operating System
k8s
Docker Selenium version (tag)
4.11.0-20230801
The text was updated successfully, but these errors were encountered: