[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

kherath17 · 2024-02-29T10:32:21Z

What happened?

Background:
I have configured a setup for Selenium Grid in Kubernetes with all relevant services which has the capability to understand incoming requests and create browser pods on demand and also to bring down the browsers after test execution through a Custom Scaler built. This is further enhanced with a preExec step on the pod lifecycle for node draining as well.

Issue:
When a pod deletion is triggered after test execution completes, it then triggers the node drain command defined on the pod lifecycle which keeps the pod in 'Terminating' status for few seconds, due to this the next consecutive test gets pointed to that terminating pod and starts it test execution where the pod gets deleted halfway through the test, resulting in a session id unknown exception.

Sample POD file:

"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": pod_name,
"labels": {
"name": pod_name,
"app": "selenium-node-edge"
}
},
"spec": {
"volumes": [
{
"name": "dshm",
"emptyDir": {
"medium": "Memory"
}
}
],
"containers": [
{
"name": "selenium-node-edge",
"image": "selenium/node-edge:latest",
"imagePullPolicy": "IfNotPresent", #Newly Added
"ports": [
{
"containerPort": 5555
}
],
"volumeMounts": [
{
"mountPath": "/dev/shm",
"name": "dshm"
}
],
"env": [
{
"name": "SE_EVENT_BUS_HOST",
"value": "selenium-hub"
},
{
"name": "SE_EVENT_BUS_SUBSCRIBE_PORT",
"value": "4443"
},
{
"name": "SE_EVENT_BUS_PUBLISH_PORT",
"value": "4442"
},
{
"name": "SE_NODE_MAX_SESSIONS",
"value": "1"
},
{
"name": "SE_NODE_GRID_URL",
"value": "https://test.cloud.test.net/qlabv2"
}
],
"resources": {
"requests": { #Newly Added
"memory": "1000Mi",
"cpu": ".1"
},
"limits": {
"memory": "1000Mi",
"cpu": ".2" #0.5
}
},
"lifecycle": {
"preStop": {
"exec": {
"command": [
"/bin/bash",
"-c",
'if [ ! -z "${SE_REGISTRATION_SECRET}" ]; then HEADERS="X-REGISTRATION-SECRET: ${SE_REGISTRATION_SECRET}"; else HEADERS="X-REGISTRATION-SECRET;"; fi; curl -k -X POST http://127.0.0.1:5555/se/grid/node/drain --header "${HEADERS}"; while curl -sfk http://127.0.0.1:5555/status; do sleep 1; done;'
]
}
}
}
}
]
}

Command used to start Selenium Grid with Docker (or Kubernetes)

I start the Selenium Grid Hub through below Deployment file 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: selenium-hub
  labels:
    app: selenium-hub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-hub
  template:
    metadata:
      labels:
        app: selenium-hub
    spec:
      containers:
      - name: selenium-hub
        #image: selenium/hub:4.13.0-20231004 #4.1,4.13
        image: selenium/hub:latest
        ports:
          - containerPort: 4444
          - containerPort: 4443
          - containerPort: 4442
        resources:
          limits:
            memory: "4000Mi"
            cpu: "2"
          requests:
            memory: "4000Mi"
            cpu: "2"  
        livenessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 30
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 30
          timeoutSeconds: 5
        env:          
          - name: SE_SUB_PATH 
            value: /something
          - name: SE_SESSION_REQUEST_TIMEOUT
            value: "900"
          - name: SE_SESSION_RETRY_INTERVAL
            value: "900"

Relevant log output

Session ID Unknown Exception (Due to POD deletion)

Operating System

Kubernetes - EKS

Docker Selenium version (image tag)

4.17.0

Selenium Grid chart version (chart version)

No response

github-actions · 2024-02-29T10:32:36Z

@kherath17, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 · 2024-03-06T07:09:44Z

In this case, how do you identify that is a new session is assigned to Terminating pod OR it is a session is in progress and wait for completion OR another possible case is new session and pod scale down at the same second?

kherath17 · 2024-03-06T07:44:17Z

@VietND96 that's the problem I have not set any explicit logic for that, I've let Selenium Grid handle that part as how it assigns the sessions to available free nodes, but the problem is it routes the new session to the terminating pod which deletes the node halfway thru the test

VietND96 · 2024-03-06T14:11:34Z

I think it would be different. Following https://www.selenium.dev/documentation/grid/advanced_features/endpoints/#drain-node, a draining node will not be assigned any new session. I think it is tested enough via UT at upstream project

https://github.com/SeleniumHQ/selenium/blob/b9a95a32a2897b3d939ec96e6083ef3b812f75e6/java/test/org/openqa/selenium/grid/distributor/local/LocalDistributorTest.java#L313

https://github.com/SeleniumHQ/selenium/blob/b9a95a32a2897b3d939ec96e6083ef3b812f75e6/java/test/org/openqa/selenium/grid/node/local/LocalNodeTest.java#L132

Your situation here could be, in your test a session was created and it was used across tests (or it served long-running execution). At any point in time, when the Queue = 0 (all requests were served) the Scaler changed the number of replicas of Node deployment, and Node pods were randomly selected to terminate. Pod gets stuck at Terminating due to the preStop script handle to wait for if any session in progress can be finished gracefully. Pods stay with status Terminating in how long depends on terminationGracePeriodSeconds
If the error is Session ID Unknown Exception, I think the session in Node was terminated before the test finished the execution. It relates to how to handle the case long-running execution. Need to identify these info e.g the sessionId in failed test was created by which Node pod, when it was created, during the period, how many sessionIds were created in Node, when the Node pod was scaled down, it stayed with status Terminating in how long, is it enough until the test finished

kherath17 · 2024-03-07T13:57:19Z

@VietND96 thanks for the insights but to provide more info our scaling down gets triggered when a user calls the driver.quit(); method by looking at the pod log values , so this issue comes into place when a user has a incident like below

driver.quit(); then on immediate next line driver.create(); , in this case what happens is scale down service triggers pod deletion which then executes the node draining command, but before the small time maybe milliseconds that it takes the node drain to trigger the driver creation gets triggered starting a browser, which is then brought down half way thru the script

FYI : This issue does not come into play if I have added some thread sleep with around 2000ms within the driver.quit(); and driver.create();

Note - driver.create(); is not an actual method just for demonstration of the problem only

VietND96 · 2024-03-08T06:01:44Z

So in your deployment, it could have a possible case, something like closing a session, creating a new session, and Scaler scaling down Node pods at the same second, it seems unpredictable.
I think in the implementation of Scaler, you can consider having a period, which is a wait time between the last time Scaler checked the queue was empty (or decreased) and the action to scale down the resource accordingly. You can refer to concept cooldownPeriod in KEDA - https://keda.sh/docs/latest/concepts/scaling-deployments/#cooldownperiod

kherath17 · 2024-03-12T07:20:01Z

Closing this to investigate a more aligning solution with consideration to above comment thanks @VietND96

github-actions · 2024-04-12T00:17:05Z

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

kherath17 added the needs-triaging label Feb 29, 2024

amardeep2006 mentioned this issue Feb 29, 2024

[🐛 Bug]: invalid session id with selenium grid #2153

Closed

VietND96 added R-awaiting-answer and removed needs-triaging labels Mar 6, 2024

kherath17 closed this as completed Mar 12, 2024

github-actions bot locked and limited conversation to collaborators Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

kherath17 commented Feb 29, 2024

github-actions bot commented Feb 29, 2024

VietND96 commented Mar 6, 2024

kherath17 commented Mar 6, 2024

VietND96 commented Mar 6, 2024

kherath17 commented Mar 7, 2024

VietND96 commented Mar 8, 2024

kherath17 commented Mar 12, 2024

github-actions bot commented Apr 12, 2024

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

Comments

kherath17 commented Feb 29, 2024

What happened?

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Operating System

Docker Selenium version (image tag)

Selenium Grid chart version (chart version)

github-actions bot commented Feb 29, 2024

VietND96 commented Mar 6, 2024

kherath17 commented Mar 6, 2024

VietND96 commented Mar 6, 2024

kherath17 commented Mar 7, 2024

VietND96 commented Mar 8, 2024

kherath17 commented Mar 12, 2024

github-actions bot commented Apr 12, 2024