Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[grid] Exclude status DRAINING when distributor getting available nodes #14282

Merged
merged 5 commits into from
Jul 25, 2024

Conversation

VietND96
Copy link
Member

@VietND96 VietND96 commented Jul 19, 2024

User description

Thanks for contributing to Selenium!
A PR well described will help maintainers to quickly review and merge it

Before submitting your PR, please check our contributing guidelines.
Avoid large PRs, help reviewers by making them as simple and short as possible.

Description

[grid] Exclude status DRAINING when distributor getting available nodes

Motivation and Context

Fix for an intermittent issue when Selenium Grid autoscaling in Kubernetes.

In autoscaling, a Node can be a one-shot node (run as Job, pod will be terminated after 1 session) or a long-run node (run as Deployment, pod will be scaled down after a cool-down period). The common thing, when the pod is terminated, node will send a drain request, and stay with status DRAINING until the current session is completed or the container stops completely.
A race condition in multi-threading allows the function `getAvailableNodes () ' to pick a free slot with status DRAINING and assign new session since only status DOWN is excluded.

The error could be seen as below

08:34:45.576 INFO [GridModel.setAvailability] - Switching Node 0f34182b-a13e-45a9-9e7a-e6fc2d09c0e2 (uri: http://10.244.114.211:5555) from UP to DRAINING
08:34:45.800 INFO [LocalDistributor.newSession] - Session created by the Distributor. Id: 139c7bd4f35e5a876129e80381c3dcff
...
08:35:14.291 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: 139c7bd4f35e5a876129e80381c3dcff
08:35:14.291 INFO [GridModel.release] - Releasing slot for session id 139c7bd4f35e5a876129e80381c3dcff
08:35:18.803 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: 2b39e48d266fdf2f5f5ef5b08560343e
08:35:18.803 INFO [GridModel.release] - Releasing slot for session id 2b39e48d266fdf2f5f5ef5b08560343e
08:35:19.493 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: 1ddaa033cdf8ad296ccc1a23a2844a0b
08:35:19.493 INFO [GridModel.release] - Releasing slot for session id 1ddaa033cdf8ad296ccc1a23a2844a0b
08:35:20.052 INFO [GridModel.release] - Releasing slot for session id b4b2087966998e5fd79012981851d739
08:35:20.053 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: b4b2087966998e5fd79012981851d739
08:35:24.197 DEBUG [Probe.Liveness] - Distributor is healthy.
08:35:24.909 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: f5799d894feb434e7f703c2ade9206ff
08:35:24.910 INFO [GridModel.release] - Releasing slot for session id f5799d894feb434e7f703c2ade9206ff
08:35:26.119 INFO [LocalSessionMap.lambda$new$0] - Deleted session from local Session Map, Id: 8ba26a73cebb4a2cd4358eb31a762fa4
08:35:26.120 INFO [GridModel.release] - Releasing slot for session id 8ba26a73cebb4a2cd4358eb31a762fa4
08:35:26.123 INFO [LocalDistributor.newSession] - Session request received by the Distributor: 
 [Capabilities {browserName: MicrosoftEdge, ms:edgeOptions: {args: [disable-features=DownloadBu...], extensions: []}, pageLoadStrategy: normal, se:downloadsEnabled: true, se:name: test_play_video (EdgeTests), se:recordVideo: true, se:screenResolution: 1920x1080}]
08:35:26.126 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "169a11fdb1e0dbd52c0f2be915bd2273","eventTime": 1721378126123213379,"eventName": "Session request received by the Distributor","attributes": {"logger": "org.openqa.selenium.grid.distributor.local.LocalDistributor","request.payload": "[Capabilities {browserName: MicrosoftEdge, ms:edgeOptions: {args: [disable-features=DownloadBu...], extensions: []}, pageLoadStrategy: normal, se:downloadsEnabled: true, se:name: test_play_video (EdgeTests), se:recordVideo: true, se:screenResolution: 1920x1080}]"}}

08:35:26.126 WARN [SeleniumSpanExporter$1.lambda$export$1] - Unable to create session: Could not start a new session. Unable to create new session 
Host info: host: 'test-selenium-hub-66db98d5c8-2sgtv', ip: '10.244.114.199'
Build info: version: '4.23.0', revision: 'Unknown'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.5.0-1023-azure', java.version: '17.0.11'
Driver info: driver.version: unknown
08:35:26.126 WARN [SeleniumSpanExporter$1.lambda$export$1] - org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Unable to create new session 
Host info: host: 'test-selenium-hub-66db98d5c8-2sgtv', ip: '10.244.114.199'
Build info: version: '4.23.0', revision: 'Unknown'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.5.0-1023-azure', java.version: '17.0.11'
Driver info: driver.version: unknown
	at org.openqa.selenium.grid.distributor.local.LocalDistributor.newSession(LocalDistributor.java:544)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.handleNewSessionRequest(LocalDistributor.java:834)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.lambda$run$1(LocalDistributor.java:791)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840) 

Linking tickets that could be similar

SeleniumHQ/docker-selenium#2129
SeleniumHQ/docker-selenium#2155

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • I have read the contributing document.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

PR Type

Bug fix


Description

  • Fixed an intermittent issue in Selenium Grid autoscaling in Kubernetes by excluding nodes with DRAINING status when the distributor fetches available nodes.
  • Prevented a race condition where a node with DRAINING status could be incorrectly assigned a new session.

Changes walkthrough 📝

Relevant files
Bug fix
LocalDistributor.java
Exclude nodes with DRAINING status from available nodes   

java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java

  • Added a condition to exclude nodes with DRAINING status when fetching
    available nodes.
  • Ensured that only nodes with statuses other than DOWN and DRAINING are
    considered available.
  • +3/-1     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ No key issues to review

    Copy link
    Contributor

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Maintainability
    Define excluded states in a static Set to simplify the filter logic

    To make the code more maintainable, consider defining the node availability states
    that should be excluded as a static Set and use it in the filter.

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [508-510]

    -.filter(
    -    node ->
    -        !DOWN.equals(node.getAvailability()) && !DRAINING.equals(node.getAvailability()))
    +private static final Set<Availability> EXCLUDED_STATES = Set.of(DOWN, DRAINING);
     
    +.filter(node -> !EXCLUDED_STATES.contains(node.getAvailability()))
    +
    • Apply this suggestion
    Suggestion importance[1-10]: 9

    Why: Defining excluded states in a static Set improves maintainability and readability by centralizing the exclusion logic, making it easier to update and understand.

    9
    Thread-safety
    Use a local variable for the snapshot to ensure thread-safety

    To ensure thread-safety and consistency, consider using a read-only snapshot of the
    node statuses before filtering, especially if the underlying model can change during
    the operation.

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [507]

    -return model.getSnapshot().stream()
    +var snapshot = model.getSnapshot();
    +return snapshot.stream()
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Using a local variable for the snapshot can enhance thread-safety by ensuring the snapshot is consistent during the operation, which is crucial in a multi-threaded environment.

    8
    Enhancement
    Replace the lambda with a method reference to enhance readability and performance

    Consider using a method reference for the availability checks to improve readability
    and potentially enhance performance by reducing lambda creation overhead.

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [508-510]

    -.filter(
    -    node ->
    -        !DOWN.equals(node.getAvailability()) && !DRAINING.equals(node.getAvailability()))
    +.filter(this::isNodeAvailable)
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using a method reference can improve readability and potentially reduce lambda creation overhead, but it requires defining the method, which adds complexity.

    7
    Robustness
    Add exception handling in the filter to avoid runtime failures

    Consider handling potential exceptions that might occur during the stream operation,
    such as from getAvailability(), to prevent runtime failures.

    java/src/org/openqa/selenium/grid/distributor/local/LocalDistributor.java [508-510]

    -.filter(
    -    node ->
    -        !DOWN.equals(node.getAvailability()) && !DRAINING.equals(node.getAvailability()))
    +.filter(node -> {
    +    try {
    +        return !DOWN.equals(node.getAvailability()) && !DRAINING.equals(node.getAvailability());
    +    } catch (Exception e) {
    +        // Handle exception or log it
    +        return false;
    +    }
    +})
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Adding exception handling can prevent runtime failures, but it may introduce additional complexity and performance overhead. It is also less likely that getAvailability() will throw an exception.

    6

    @VietND96
    Copy link
    Member Author

    @diemol, can you review this?

    Copy link
    Contributor

    @pujagani pujagani left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    @VietND96 Thank you!

    @pujagani pujagani merged commit 139af72 into SeleniumHQ:trunk Jul 25, 2024
    28 of 29 checks passed
    @edsherwin
    Copy link

    Waiting for this release as well as I encountered issues when running parallel tests using Kubernetes with auto KEDA autoscaling. Hope this will fix my issue. Thanks

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants