Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

Closed
kherath17 opened this issue Feb 29, 2024 · 8 comments
Closed

[🐛 Bug]: Tests gets pointed to Terminating PODs in K8 #2155

kherath17 opened this issue Feb 29, 2024 · 8 comments

Comments

@kherath17
Copy link

What happened?

Background:
I have configured a setup for Selenium Grid in Kubernetes with all relevant services which has the capability to understand incoming requests and create browser pods on demand and also to bring down the browsers after test execution through a Custom Scaler built. This is further enhanced with a preExec step on the pod lifecycle for node draining as well.

Issue:
When a pod deletion is triggered after test execution completes, it then triggers the node drain command defined on the pod lifecycle which keeps the pod in 'Terminating' status for few seconds, due to this the next consecutive test gets pointed to that terminating pod and starts it test execution where the pod gets deleted halfway through the test, resulting in a session id unknown exception.

Sample POD file:

"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": pod_name,
"labels": {
"name": pod_name,
"app": "selenium-node-edge"
}
},
"spec": {
"volumes": [
{
"name": "dshm",
"emptyDir": {
"medium": "Memory"
}
}
],
"containers": [
{
"name": "selenium-node-edge",
"image": "selenium/node-edge:latest",
"imagePullPolicy": "IfNotPresent", #Newly Added
"ports": [
{
"containerPort": 5555
}
],
"volumeMounts": [
{
"mountPath": "/dev/shm",
"name": "dshm"
}
],
"env": [
{
"name": "SE_EVENT_BUS_HOST",
"value": "selenium-hub"
},
{
"name": "SE_EVENT_BUS_SUBSCRIBE_PORT",
"value": "4443"
},
{
"name": "SE_EVENT_BUS_PUBLISH_PORT",
"value": "4442"
},
{
"name": "SE_NODE_MAX_SESSIONS",
"value": "1"
},
{
"name": "SE_NODE_GRID_URL",
"value": "https://test.cloud.test.net/qlabv2"
}
],
"resources": {
"requests": { #Newly Added
"memory": "1000Mi",
"cpu": ".1"
},
"limits": {
"memory": "1000Mi",
"cpu": ".2" #0.5
}
},
"lifecycle": {
"preStop": {
"exec": {
"command": [
"/bin/bash",
"-c",
'if [ ! -z "${SE_REGISTRATION_SECRET}" ]; then HEADERS="X-REGISTRATION-SECRET: ${SE_REGISTRATION_SECRET}"; else HEADERS="X-REGISTRATION-SECRET;"; fi; curl -k -X POST http://127.0.0.1:5555/se/grid/node/drain --header "${HEADERS}"; while curl -sfk http://127.0.0.1:5555/status; do sleep 1; done;'
]
}
}
}
}
]
}

Command used to start Selenium Grid with Docker (or Kubernetes)

I start the Selenium Grid Hub through below Deployment file 

apiVersion: apps/v1
kind: Deployment
metadata:
  name: selenium-hub
  labels:
    app: selenium-hub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: selenium-hub
  template:
    metadata:
      labels:
        app: selenium-hub
    spec:
      containers:
      - name: selenium-hub
        #image: selenium/hub:4.13.0-20231004 #4.1,4.13
        image: selenium/hub:latest
        ports:
          - containerPort: 4444
          - containerPort: 4443
          - containerPort: 4442
        resources:
          limits:
            memory: "4000Mi"
            cpu: "2"
          requests:
            memory: "4000Mi"
            cpu: "2"  
        livenessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 30
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /wd/hub/status
            port: 4444
          initialDelaySeconds: 30
          timeoutSeconds: 5
        env:          
          - name: SE_SUB_PATH 
            value: /something
          - name: SE_SESSION_REQUEST_TIMEOUT
            value: "900"
          - name: SE_SESSION_RETRY_INTERVAL
            value: "900"

Relevant log output

Session ID Unknown Exception (Due to POD deletion)

Operating System

Kubernetes - EKS

Docker Selenium version (image tag)

4.17.0

Selenium Grid chart version (chart version)

No response

Copy link

@kherath17, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

VietND96 commented Mar 6, 2024

In this case, how do you identify that is a new session is assigned to Terminating pod OR it is a session is in progress and wait for completion OR another possible case is new session and pod scale down at the same second?

@kherath17
Copy link
Author

@VietND96 that's the problem I have not set any explicit logic for that, I've let Selenium Grid handle that part as how it assigns the sessions to available free nodes, but the problem is it routes the new session to the terminating pod which deletes the node halfway thru the test

@VietND96
Copy link
Member

VietND96 commented Mar 6, 2024

I think it would be different. Following https://www.selenium.dev/documentation/grid/advanced_features/endpoints/#drain-node, a draining node will not be assigned any new session. I think it is tested enough via UT at upstream project

https://github.com/SeleniumHQ/selenium/blob/b9a95a32a2897b3d939ec96e6083ef3b812f75e6/java/test/org/openqa/selenium/grid/distributor/local/LocalDistributorTest.java#L313

https://github.com/SeleniumHQ/selenium/blob/b9a95a32a2897b3d939ec96e6083ef3b812f75e6/java/test/org/openqa/selenium/grid/node/local/LocalNodeTest.java#L132

Your situation here could be, in your test a session was created and it was used across tests (or it served long-running execution). At any point in time, when the Queue = 0 (all requests were served) the Scaler changed the number of replicas of Node deployment, and Node pods were randomly selected to terminate. Pod gets stuck at Terminating due to the preStop script handle to wait for if any session in progress can be finished gracefully. Pods stay with status Terminating in how long depends on terminationGracePeriodSeconds
If the error is Session ID Unknown Exception, I think the session in Node was terminated before the test finished the execution. It relates to how to handle the case long-running execution. Need to identify these info e.g the sessionId in failed test was created by which Node pod, when it was created, during the period, how many sessionIds were created in Node, when the Node pod was scaled down, it stayed with status Terminating in how long, is it enough until the test finished

@kherath17
Copy link
Author

@VietND96 thanks for the insights but to provide more info our scaling down gets triggered when a user calls the driver.quit(); method by looking at the pod log values , so this issue comes into place when a user has a incident like below

driver.quit(); then on immediate next line driver.create(); , in this case what happens is scale down service triggers pod deletion which then executes the node draining command, but before the small time maybe milliseconds that it takes the node drain to trigger the driver creation gets triggered starting a browser, which is then brought down half way thru the script

FYI : This issue does not come into play if I have added some thread sleep with around 2000ms within the driver.quit(); and driver.create();

Note - driver.create(); is not an actual method just for demonstration of the problem only

@VietND96
Copy link
Member

VietND96 commented Mar 8, 2024

So in your deployment, it could have a possible case, something like closing a session, creating a new session, and Scaler scaling down Node pods at the same second, it seems unpredictable.
I think in the implementation of Scaler, you can consider having a period, which is a wait time between the last time Scaler checked the queue was empty (or decreased) and the action to scale down the resource accordingly. You can refer to concept cooldownPeriod in KEDA - https://keda.sh/docs/latest/concepts/scaling-deployments/#cooldownperiod

@kherath17
Copy link
Author

Closing this to investigate a more aligning solution with consideration to above comment thanks @VietND96

Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants