Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(chart): probe checks for Distributor and Router #2272

Merged
merged 1 commit into from
May 30, 2024

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented May 29, 2024

User description

Thanks for contributing to the Docker-Selenium project!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines, applied for this repository.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

feat(chart): probe checks for Distributor and Router

Motivation and Context

Distributor Probes

By default, startupProbe, readinessProbe and livenessProbe are enabled for this component in both full distributed and Hub-Nodes mode.

There is a script in chart configs/distributor/distributorProbe.sh is loaded into ConfigMap and mounted to the container is used by livenessProbe. You can customize the script via --set-file distributorConfigMap.extraScripts.distributorProbe\.sh=/path/to/your_script.sh or set via YAML values.

There are some reports on a scenario that would be difficult to reproduce or rare: Grid UI is accessible but no nodes can be fetched or registered. Or something like there are few requests in session queue but could not be accepted. After restarting the Distributor, the issue is resolved. Based on that, a proactive approach to do automatic restart whenever detecting it is not healthy via livenessProbe and the condition check is executed. The script queries GraphQL endpoint to get sessionCount, and sessionQueueSize. If the sessionQueueSize is greater than 0 and sessionCount is 0 until the failureThreshold, the Distributor will be restarted. You can adjust the threshold as well as interval via probe settings.

Router Probes

By default, startupProbe, readinessProbe and livenessProbe are enabled for this component in full distributed mode.

There is a script in chart configs/router/routerProbe.sh loaded into ConfigMap and mounted to the container is used by livenessProbe. You can customize the script via --set-file routerConfigMap.extraScripts.routerProbe\.sh=/path/to/your_script.sh or set via YAML values.

The script checks GraphQL endpoint is reachable. If the http_code is not 200 until the failureThreshold, the Router will be restarted. You can adjust the threshold as well as interval via probe settings.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Enhancement, Tests


Description

  • Added liveness probe scripts for distributor and router components.
  • Integrated probe scripts into distributor, router, and hub deployments.
  • Updated Dockerfile to reduce image size by using --no-install-recommends.
  • Fixed ChromeDriver version retrieval for arm64 architecture.
  • Improved test script robustness and platform handling.
  • Updated README with new probe configuration sections.
  • Added configuration for distributor and router probes in values.yaml.

Changes walkthrough 📝

Relevant files
Enhancement
12 files
distributorProbe.sh
Add distributor liveness probe script for health checks. 

charts/selenium-grid/configs/distributor/distributorProbe.sh

  • Added a script for distributor liveness probe.
  • Script checks GraphQL endpoint and session queue size.
  • Restarts distributor if unhealthy conditions are met.
  • +39/-0   
    routerProbe.sh
    Add router liveness probe script for health checks.           

    charts/selenium-grid/configs/router/routerProbe.sh

  • Added a script for router liveness probe.
  • Script checks GraphQL endpoint reachability.
  • Restarts router if endpoint is unreachable.
  • +27/-0   
    generate_release_notes.sh
    Fix ChromeDriver version retrieval and update release notes.

    generate_release_notes.sh

  • Fixed ChromeDriver version retrieval for arm64 architecture.
  • Updated release notes generation to include ChromiumDriver version.
  • +3/-4     
    _nameHelpers.tpl
    Add templates for distributor and router ConfigMap names.

    charts/selenium-grid/templates/_nameHelpers.tpl

    • Added templates for distributor and router ConfigMap full names.
    +14/-0   
    Dockerfile
    Optimize Dockerfile to reduce image size.                               

    NodeChromium/Dockerfile

    • Added --no-install-recommends flag to reduce image size.
    +2/-2     
    distributor-configmap.yaml
    Add ConfigMap template for distributor probe scripts.       

    charts/selenium-grid/templates/distributor-configmap.yaml

    • Added ConfigMap template for distributor probe scripts.
    +31/-0   
    distributor-deployment.yaml
    Integrate distributor probe scripts into deployment configuration.

    charts/selenium-grid/templates/distributor-deployment.yaml

  • Integrated distributor probe scripts into deployment.
  • Added environment variables and volume mounts for probes.
  • +63/-0   
    hub-deployment.yaml
    Integrate distributor probe scripts into hub deployment configuration.

    charts/selenium-grid/templates/hub-deployment.yaml

  • Integrated distributor probe scripts into hub deployment.
  • Added environment variables and volume mounts for probes.
  • +15/-0   
    router-configmap.yaml
    Add ConfigMap template for router probe scripts.                 

    charts/selenium-grid/templates/router-configmap.yaml

    • Added ConfigMap template for router probe scripts.
    +30/-0   
    router-deployment.yaml
    Integrate router probe scripts into deployment configuration.

    charts/selenium-grid/templates/router-deployment.yaml

  • Integrated router probe scripts into deployment.
  • Added environment variables and volume mounts for probes.
  • +13/-0   
    server-configmap.yaml
    Add environment variable configuration to server ConfigMap.

    charts/selenium-grid/templates/server-configmap.yaml

    • Added environment variable configuration for server.
    +3/-0     
    values.yaml
    Add configuration for distributor and router probes.         

    charts/selenium-grid/values.yaml

  • Added configuration for distributor and router probes.
  • Added default liveness probe method for components.
  • +61/-0   
    Tests
    2 files
    chart_test.sh
    Improve test script robustness and platform handling.       

    tests/charts/make/chart_test.sh

  • Added waiting mechanism for terminating pods.
  • Fixed platform variable usage in test scripts.
  • +17/-4   
    base-resources-values.yaml
    Add liveness probe failure threshold for router and distributor.

    tests/charts/ci/base-resources-values.yaml

    • Added liveness probe failure threshold for router and distributor.
    +6/-0     
    Bug fix
    1 files
    __init__.py
    Fix typo in TEST_PLATFORMS environment variable.                 

    tests/SeleniumTests/init.py

    • Corrected typo in TEST_PLATFORMS default value.
    +1/-1     
    Documentation
    1 files
    README.md
    Update README with distributor and router probe configurations.

    charts/selenium-grid/README.md

  • Updated README with new probe configuration sections.
  • Added details for distributor and router probes.
  • +38/-16 

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link

    PR Review 🔍

    ⏱️ Estimated effort to review [1-5]

    4, due to the extensive changes across multiple scripts and configurations, including new probe checks and environment variable adjustments. The complexity of the changes, especially around the probe scripts and deployment configurations, requires careful review to ensure they function as intended without side effects.

    🧪 Relevant tests

    No

    ⚡ Possible issues

    Possible Bug: The use of environment variables like ROUTER_USERNAME and ROUTER_PASSWORD in the probe scripts without proper validation or default values could lead to issues if these variables are not set, potentially causing the script to fail or behave unexpectedly.

    Performance Concern: The probe scripts make multiple external calls using curl which could impact the performance if not properly managed, especially under high load or in a production environment.

    🔒 Security concerns

    No

    Copy link

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Possible issue
    Add a timeout mechanism to the loop that waits for pods to terminate to avoid infinite loops

    Add a timeout mechanism to the loop that waits for pods to terminate, to avoid potential
    infinite loops.

    tests/charts/make/chart_test.sh [262-271]

    -while true; do
    +timeout=300
    +elapsed=0
    +while [ $elapsed -lt $timeout ]; do
       terminating_pods=$(kubectl get pods -n ${SELENIUM_NAMESPACE} --no-headers | grep Terminating | wc -l)
       if [ $terminating_pods -eq 0 ]; then
         echo "No pods in 'Terminating' state."
         break
       else
         echo "Waiting for $terminating_pods pod(s) to terminate..."
         sleep 2
    +    elapsed=$((elapsed + 2))
       fi
     done
    +if [ $elapsed -ge $timeout ]; then
    +  echo "Timeout reached while waiting for pods to terminate."
    +fi
     
    Suggestion importance[1-10]: 9

    Why: Implementing a timeout mechanism is critical to avoid potential infinite loops in the script, which can lead to resource exhaustion or stalled operations, especially in production environments.

    9
    Add quotes around a variable to handle cases where the variable might be empty or contain spaces

    Add quotes around the variable SELENIUM_GRID_HOST to prevent potential issues if the
    variable is empty or contains spaces.

    tests/charts/make/chart_test.sh [31]

    -HOSTNAME_ADDRESS=${HOSTNAME_ADDRESS:-${SELENIUM_GRID_HOST}}
    +HOSTNAME_ADDRESS=${HOSTNAME_ADDRESS:-"${SELENIUM_GRID_HOST}"}
     
    Suggestion importance[1-10]: 8

    Why: Adding quotes around the variable SELENIUM_GRID_HOST is crucial to prevent script errors in cases where the variable is empty or contains spaces, which could lead to unexpected behavior or security issues.

    8
    Add quotes around variables to handle cases where the variables might be empty or contain spaces

    Add double quotes around the variable ${ROUTER_USERNAME} and ${ROUTER_PASSWORD} to prevent
    potential issues if the variables are empty or contain spaces.

    charts/selenium-grid/configs/distributor/distributorProbe.sh [8]

    -if [ -n ${ROUTER_USERNAME} ] && [ -n ${ROUTER_PASSWORD} ]; then
    +if [ -n "${ROUTER_USERNAME}" ] && [ -n "${ROUTER_PASSWORD}" ]; then
     
    Suggestion importance[1-10]: 8

    Why: Adding quotes around the variables ${ROUTER_USERNAME} and ${ROUTER_PASSWORD} is important to ensure that the script handles cases where these variables might be empty or contain spaces, preventing script failures or unintended behavior.

    8
    Add a timeoutSeconds value to the livenessProbe configuration to prevent indefinite hangs

    The livenessProbe configuration for the distributor should include a timeoutSeconds value
    to ensure that the probe does not hang indefinitely.

    charts/selenium-grid/templates/distributor-deployment.yaml [132-133]

     exec:
       command: ["bash", "-c", "{{ $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"]
    +  timeoutSeconds: 10
     
    Suggestion importance[1-10]: 7

    Why: The suggestion is valid as it addresses a potential issue where the liveness probe could hang indefinitely. However, the PR already includes timeoutSeconds in the probe configurations, making this suggestion slightly redundant but still valuable for clarity.

    7
    Add a default value for extraScriptsDirectory to avoid potential issues with undefined values

    To ensure consistency and avoid potential issues with missing or incorrect values,
    consider adding a default value for $.Values.distributorConfigMap.extraScriptsDirectory in
    case it is not defined.

    charts/selenium-grid/templates/hub-deployment.yaml [89]

    -command: ["bash", "-c", "{{ $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"]
    +command: ["bash", "-c", "{{ default "/default/path" $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"]
     
    Suggestion importance[1-10]: 7

    Why: The suggestion to add a default value for extraScriptsDirectory is valid and improves robustness by handling cases where the variable might not be defined.

    7
    Maintainability
    Remove the duplicate entry for global.seleniumGrid.imagePullSecret in the global configuration table

    The entry for global.seleniumGrid.imagePullSecret is duplicated in the global
    configuration table. This redundancy can cause confusion and should be removed.

    charts/selenium-grid/README.md [312-313]

    -| `global.seleniumGrid.imagePullSecret`               | `""`                    | Pull secret to be used for all images       |
     | `global.seleniumGrid.imagePullSecret`               | `""`                    | Pull secret to be used for all images       |
     
    Suggestion importance[1-10]: 8

    Why: The suggestion correctly identifies and resolves a redundancy issue in the documentation, which improves maintainability and reduces confusion.

    8
    Best practice
    Change the default value of extraScripts to null to avoid potential script loading issues

    The extraScripts field for distributorConfigMap and routerConfigMap should have a default
    value of null instead of an empty string to avoid potential issues with script loading.

    charts/selenium-grid/values.yaml [148-149]

     extraScripts:
    -  distributorProbe.sh: ""
    +  distributorProbe.sh: null
     
    Suggestion importance[1-10]: 6

    Why: The suggestion to use null instead of an empty string for extraScripts could potentially avoid issues in script handling. This is a good practice but not critical.

    6
    Adjust indentation for env values to improve readability and consistency

    To ensure that the env values are properly indented and formatted, consider using nindent
    4 instead of nindent 2 for better readability and consistency with other YAML structures.

    charts/selenium-grid/templates/server-configmap.yaml [16-18]

     {{- with $.Values.serverConfigMap.env }}
    -  {{- toYaml . | nindent 2 }}
    +  {{- toYaml . | nindent 4 }}
     {{- end }}
     
    Suggestion importance[1-10]: 5

    Why: Adjusting the indentation for better readability and consistency is a minor improvement, but it does enhance the maintainability of the YAML configuration.

    5
    Performance
    Combine apt-get commands to reduce image size and improve security

    To reduce the size of the Docker image and improve security, consider combining the
    apt-get commands into a single RUN statement and cleaning up unnecessary files in one
    step.

    NodeChromium/Dockerfile [10-14]

     RUN echo "deb http://deb.debian.org/debian/ sid main" >> /etc/apt/sources.list \
       && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0E98404D386FA1D9 6ED0E7B82643E131 \
       && apt-get update -qqy \
    -  && apt-get -qqy --no-install-recommends install chromium \
    +  && apt-get -qqy --no-install-recommends install chromium chromium-driver \
       && rm -rf /var/lib/apt/lists/* /var/cache/apt/*
     
    Suggestion importance[1-10]: 6

    Why: Combining apt-get commands into a single RUN statement is a good practice for reducing layers in Docker images, which can enhance performance and security.

    6
    Clarity
    Clarify the description of the distributorProbe.sh script to specify its use for the livenessProbe

    The description for the distributorProbe.sh script in the Distributor Probes section
    should clarify that the script is used for the livenessProbe.

    charts/selenium-grid/README.md [424]

    -There is a script in chart `configs/distributor/distributorProbe.sh` is loaded into ConfigMap and mounted to the container is used by `livenessProbe`.
    +There is a script in chart `configs/distributor/distributorProbe.sh` that is loaded into ConfigMap and mounted to the container, which is used by the `livenessProbe`.
     
    Suggestion importance[1-10]: 5

    Why: The suggestion improves clarity in the documentation by specifying the use of the script. However, the existing description in the PR is already quite clear about the script's purpose, making this improvement minor.

    5

    @VietND96 VietND96 merged commit f2280e9 into SeleniumHQ:trunk May 30, 2024
    27 checks passed
    @VietND96 VietND96 added this to the 4.22 milestone May 30, 2024
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant