feat(chart): probe checks for Distributor and Router #2272

VietND96 · 2024-05-29T11:29:05Z

User description

Thanks for contributing to the Docker-Selenium project!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines, applied for this repository.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

feat(chart): probe checks for Distributor and Router

Motivation and Context

Distributor Probes

By default, startupProbe, readinessProbe and livenessProbe are enabled for this component in both full distributed and Hub-Nodes mode.

There is a script in chart configs/distributor/distributorProbe.sh is loaded into ConfigMap and mounted to the container is used by livenessProbe. You can customize the script via --set-file distributorConfigMap.extraScripts.distributorProbe\.sh=/path/to/your_script.sh or set via YAML values.

There are some reports on a scenario that would be difficult to reproduce or rare: Grid UI is accessible but no nodes can be fetched or registered. Or something like there are few requests in session queue but could not be accepted. After restarting the Distributor, the issue is resolved. Based on that, a proactive approach to do automatic restart whenever detecting it is not healthy via livenessProbe and the condition check is executed. The script queries GraphQL endpoint to get sessionCount, and sessionQueueSize. If the sessionQueueSize is greater than 0 and sessionCount is 0 until the failureThreshold, the Distributor will be restarted. You can adjust the threshold as well as interval via probe settings.

Router Probes

By default, startupProbe, readinessProbe and livenessProbe are enabled for this component in full distributed mode.

There is a script in chart configs/router/routerProbe.sh loaded into ConfigMap and mounted to the container is used by livenessProbe. You can customize the script via --set-file routerConfigMap.extraScripts.routerProbe\.sh=/path/to/your_script.sh or set via YAML values.

The script checks GraphQL endpoint is reachable. If the http_code is not 200 until the failureThreshold, the Router will be restarted. You can adjust the threshold as well as interval via probe settings.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist

I have read the contributing document.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

PR Type

Enhancement, Tests

Description

Added liveness probe scripts for distributor and router components.
Integrated probe scripts into distributor, router, and hub deployments.
Updated Dockerfile to reduce image size by using --no-install-recommends.
Fixed ChromeDriver version retrieval for arm64 architecture.
Improved test script robustness and platform handling.
Updated README with new probe configuration sections.
Added configuration for distributor and router probes in values.yaml.

Changes walkthrough 📝

Relevant files

Enhancement

12 files

distributorProbe.sh `Add distributor liveness probe script for health checks.` charts/selenium-grid/configs/distributor/distributorProbe.sh Added a script for distributor liveness probe. Script checks GraphQL endpoint and session queue size. Restarts distributor if unhealthy conditions are met.	+39/-0
routerProbe.sh `Add router liveness probe script for health checks.` charts/selenium-grid/configs/router/routerProbe.sh Added a script for router liveness probe. Script checks GraphQL endpoint reachability. Restarts router if endpoint is unreachable.	+27/-0
generate_release_notes.sh `Fix ChromeDriver version retrieval and update release notes.` generate_release_notes.sh Fixed ChromeDriver version retrieval for arm64 architecture. Updated release notes generation to include ChromiumDriver version.	+3/-4
_nameHelpers.tpl `Add templates for distributor and router ConfigMap names.` charts/selenium-grid/templates/_nameHelpers.tpl Added templates for distributor and router ConfigMap full names.	+14/-0
Dockerfile `Optimize Dockerfile to reduce image size.` NodeChromium/Dockerfile Added `--no-install-recommends` flag to reduce image size.	+2/-2
distributor-configmap.yaml `Add ConfigMap template for distributor probe scripts.` charts/selenium-grid/templates/distributor-configmap.yaml Added ConfigMap template for distributor probe scripts.	+31/-0
distributor-deployment.yaml `Integrate distributor probe scripts into deployment configuration.` charts/selenium-grid/templates/distributor-deployment.yaml Integrated distributor probe scripts into deployment. Added environment variables and volume mounts for probes.	+63/-0
hub-deployment.yaml `Integrate distributor probe scripts into hub deployment configuration.` charts/selenium-grid/templates/hub-deployment.yaml Integrated distributor probe scripts into hub deployment. Added environment variables and volume mounts for probes.	+15/-0
router-configmap.yaml `Add ConfigMap template for router probe scripts.` charts/selenium-grid/templates/router-configmap.yaml Added ConfigMap template for router probe scripts.	+30/-0
router-deployment.yaml `Integrate router probe scripts into deployment configuration.` charts/selenium-grid/templates/router-deployment.yaml Integrated router probe scripts into deployment. Added environment variables and volume mounts for probes.	+13/-0
server-configmap.yaml `Add environment variable configuration to server ConfigMap.` charts/selenium-grid/templates/server-configmap.yaml Added environment variable configuration for server.	+3/-0
values.yaml `Add configuration for distributor and router probes.` charts/selenium-grid/values.yaml Added configuration for distributor and router probes. Added default liveness probe method for components.	+61/-0

Tests

2 files

chart_test.sh `Improve test script robustness and platform handling.` tests/charts/make/chart_test.sh Added waiting mechanism for terminating pods. Fixed platform variable usage in test scripts.	+17/-4
base-resources-values.yaml `Add liveness probe failure threshold for router and distributor.` tests/charts/ci/base-resources-values.yaml Added liveness probe failure threshold for router and distributor.	+6/-0

Bug fix

1 files

__init__.py `Fix typo in TEST_PLATFORMS environment variable.` tests/SeleniumTests/init.py Corrected typo in `TEST_PLATFORMS` default value.	+1/-1

Documentation

1 files

README.md `Update README with distributor and router probe configurations.` charts/selenium-grid/README.md Updated README with new probe configuration sections. Added details for distributor and router probes.	+38/-16

💡 PR-Agent usage:
Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

qodo-merge-pro · 2024-05-29T11:29:56Z

PR Review 🔍

⏱️ Estimated effort to review [1-5]	4, due to the extensive changes across multiple scripts and configurations, including new probe checks and environment variable adjustments. The complexity of the changes, especially around the probe scripts and deployment configurations, requires careful review to ensure they function as intended without side effects.
🧪 Relevant tests	No
⚡ Possible issues	Possible Bug: The use of environment variables like `ROUTER_USERNAME` and `ROUTER_PASSWORD` in the probe scripts without proper validation or default values could lead to issues if these variables are not set, potentially causing the script to fail or behave unexpectedly.
⚡ Possible issues	Performance Concern: The probe scripts make multiple external calls using `curl` which could impact the performance if not properly managed, especially under high load or in a production environment.
🔒 Security concerns	No

qodo-merge-pro · 2024-05-29T11:30:32Z

PR Code Suggestions ✨

Category	Suggestion	Score
Possible issue	Add a timeout mechanism to the loop that waits for pods to terminate to avoid infinite loops Add a timeout mechanism to the loop that waits for pods to terminate, to avoid potential infinite loops. tests/charts/make/chart_test.sh [262-271] -while true; do +timeout=300 +elapsed=0 +while [ $elapsed -lt $timeout ]; do terminating_pods=$(kubectl get pods -n ${SELENIUM_NAMESPACE} --no-headers \| grep Terminating \| wc -l) if [ $terminating_pods -eq 0 ]; then echo "No pods in 'Terminating' state." break else echo "Waiting for $terminating_pods pod(s) to terminate..." sleep 2 + elapsed=$((elapsed + 2)) fi done +if [ $elapsed -ge $timeout ]; then + echo "Timeout reached while waiting for pods to terminate." +fi Suggestion importance[1-10]: 9 Why: Implementing a timeout mechanism is critical to avoid potential infinite loops in the script, which can lead to resource exhaustion or stalled operations, especially in production environments.	9
	Add quotes around a variable to handle cases where the variable might be empty or contain spaces Add quotes around the variable `SELENIUM_GRID_HOST` to prevent potential issues if the variable is empty or contains spaces. tests/charts/make/chart_test.sh [31] -HOSTNAME_ADDRESS=${HOSTNAME_ADDRESS:-${SELENIUM_GRID_HOST}} +HOSTNAME_ADDRESS=${HOSTNAME_ADDRESS:-"${SELENIUM_GRID_HOST}"} Suggestion importance[1-10]: 8 Why: Adding quotes around the variable `SELENIUM_GRID_HOST` is crucial to prevent script errors in cases where the variable is empty or contains spaces, which could lead to unexpected behavior or security issues.	8
	Add quotes around variables to handle cases where the variables might be empty or contain spaces Add double quotes around the variable `${ROUTER_USERNAME}` and `${ROUTER_PASSWORD}` to prevent potential issues if the variables are empty or contain spaces. charts/selenium-grid/configs/distributor/distributorProbe.sh [8] -if [ -n ${ROUTER_USERNAME} ] && [ -n ${ROUTER_PASSWORD} ]; then +if [ -n "${ROUTER_USERNAME}" ] && [ -n "${ROUTER_PASSWORD}" ]; then Suggestion importance[1-10]: 8 Why: Adding quotes around the variables `${ROUTER_USERNAME}` and `${ROUTER_PASSWORD}` is important to ensure that the script handles cases where these variables might be empty or contain spaces, preventing script failures or unintended behavior.	8
	Add a `timeoutSeconds` value to the `livenessProbe` configuration to prevent indefinite hangs The `livenessProbe` configuration for the distributor should include a `timeoutSeconds` value to ensure that the probe does not hang indefinitely. charts/selenium-grid/templates/distributor-deployment.yaml [132-133] exec: command: ["bash", "-c", "{{ $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"] + timeoutSeconds: 10 Suggestion importance[1-10]: 7 Why: The suggestion is valid as it addresses a potential issue where the liveness probe could hang indefinitely. However, the PR already includes `timeoutSeconds` in the probe configurations, making this suggestion slightly redundant but still valuable for clarity.	7
	Add a default value for `extraScriptsDirectory` to avoid potential issues with undefined values To ensure consistency and avoid potential issues with missing or incorrect values, consider adding a default value for `$.Values.distributorConfigMap.extraScriptsDirectory` in case it is not defined. charts/selenium-grid/templates/hub-deployment.yaml [89] -command: ["bash", "-c", "{{ $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"] +command: ["bash", "-c", "{{ default "/default/path" $.Values.distributorConfigMap.extraScriptsDirectory }}/distributorProbe.sh Liveness {{ include "seleniumGrid.probe.stdout" $ }}"] Suggestion importance[1-10]: 7 Why: The suggestion to add a default value for `extraScriptsDirectory` is valid and improves robustness by handling cases where the variable might not be defined.	7
Maintainability	Remove the duplicate entry for `global.seleniumGrid.imagePullSecret` in the global configuration table The entry for `global.seleniumGrid.imagePullSecret` is duplicated in the global configuration table. This redundancy can cause confusion and should be removed. charts/selenium-grid/README.md [312-313] -\| `global.seleniumGrid.imagePullSecret` \| `""` \| Pull secret to be used for all images \| \| `global.seleniumGrid.imagePullSecret` \| `""` \| Pull secret to be used for all images \| Suggestion importance[1-10]: 8 Why: The suggestion correctly identifies and resolves a redundancy issue in the documentation, which improves maintainability and reduces confusion.	8
Best practice	Change the default value of `extraScripts` to `null` to avoid potential script loading issues The `extraScripts` field for `distributorConfigMap` and `routerConfigMap` should have a default value of `null` instead of an empty string to avoid potential issues with script loading. charts/selenium-grid/values.yaml [148-149] extraScripts: - distributorProbe.sh: "" + distributorProbe.sh: null Suggestion importance[1-10]: 6 Why: The suggestion to use `null` instead of an empty string for `extraScripts` could potentially avoid issues in script handling. This is a good practice but not critical.	6
Best practice	Adjust indentation for `env` values to improve readability and consistency To ensure that the `env` values are properly indented and formatted, consider using `nindent` `4` instead of `nindent 2` for better readability and consistency with other YAML structures. charts/selenium-grid/templates/server-configmap.yaml [16-18] {{- with $.Values.serverConfigMap.env }} - {{- toYaml . \| nindent 2 }} + {{- toYaml . \| nindent 4 }} {{- end }} Suggestion importance[1-10]: 5 Why: Adjusting the indentation for better readability and consistency is a minor improvement, but it does enhance the maintainability of the YAML configuration.	5
Performance	Combine `apt-get` commands to reduce image size and improve security To reduce the size of the Docker image and improve security, consider combining the `apt-get` commands into a single `RUN` statement and cleaning up unnecessary files in one step. NodeChromium/Dockerfile [10-14] RUN echo "deb http://deb.debian.org/debian/ sid main" >> /etc/apt/sources.list \ && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 0E98404D386FA1D9 6ED0E7B82643E131 \ && apt-get update -qqy \ - && apt-get -qqy --no-install-recommends install chromium \ + && apt-get -qqy --no-install-recommends install chromium chromium-driver \ && rm -rf /var/lib/apt/lists/* /var/cache/apt/* Suggestion importance[1-10]: 6 Why: Combining `apt-get` commands into a single `RUN` statement is a good practice for reducing layers in Docker images, which can enhance performance and security.	6
Clarity	Clarify the description of the `distributorProbe.sh` script to specify its use for the `livenessProbe` The description for the `distributorProbe.sh` script in the `Distributor Probes` section should clarify that the script is used for the `livenessProbe`. charts/selenium-grid/README.md [424] -There is a script in chart `configs/distributor/distributorProbe.sh` is loaded into ConfigMap and mounted to the container is used by `livenessProbe`. +There is a script in chart `configs/distributor/distributorProbe.sh` that is loaded into ConfigMap and mounted to the container, which is used by the `livenessProbe`. Suggestion importance[1-10]: 5 Why: The suggestion improves clarity in the documentation by specifying the use of the script. However, the existing description in the PR is already quite clear about the script's purpose, making this improvement minor.	5

Signed-off-by: Viet Nguyen Duc <[email protected]>

qodo-merge-pro bot added enhancement tests labels May 29, 2024

qodo-merge-pro bot added the Review effort [1-5]: 4 label May 29, 2024

VietND96 force-pushed the trunk branch from 480732c to 2728e61 Compare May 29, 2024 11:32

feat(chart): probe checks for Distributor and Router

75c0470

Signed-off-by: Viet Nguyen Duc <[email protected]>

VietND96 force-pushed the trunk branch from 2728e61 to 75c0470 Compare May 29, 2024 20:30

VietND96 merged commit f2280e9 into SeleniumHQ:trunk May 30, 2024
27 checks passed

VietND96 added this to the 4.22 milestone May 30, 2024

VietND96 mentioned this pull request May 30, 2024

[🐛 Bug]: Selenium nodes stop connecting to the grid after some time has passed. #1913

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chart): probe checks for Distributor and Router #2272

feat(chart): probe checks for Distributor and Router #2272

VietND96 commented May 29, 2024 •

edited by qodo-merge-pro bot

Loading

qodo-merge-pro bot commented May 29, 2024

qodo-merge-pro bot commented May 29, 2024

feat(chart): probe checks for Distributor and Router #2272

feat(chart): probe checks for Distributor and Router #2272

Conversation

VietND96 commented May 29, 2024 • edited by qodo-merge-pro bot Loading

User description

Description

Motivation and Context

Distributor Probes

Router Probes

Types of changes

Checklist

PR Type

Description

Changes walkthrough 📝

qodo-merge-pro bot commented May 29, 2024

PR Review 🔍

qodo-merge-pro bot commented May 29, 2024

PR Code Suggestions ✨

VietND96 commented May 29, 2024 •

edited by qodo-merge-pro bot

Loading