Improve support for running large-scale grids #3574

gregjhogan · 2017-02-26T16:56:59Z

My experience has been that the hub can be a real bottle neck when scaling above 50 nodes. It appears others have run into scalability issues, too, based on projects like seleniumkit/gridrouter.

I can think of a couple potential solutions:

The hub no longer proxies all requests - Use the hub to request a new session (let it find and choose a node), return the hostname/IP of the node back to the client, then have the client talk directly to the node to run the test.
Use a high performance layer 7 application proxy (like nginx) - Maybe I am wrong, but I thought this might increase scalability of the hub component. I feel like this could be a natural fit for layer 7 URL based routing. The /session requests would get routed to the hub node, which would dynamically inject a URL path based routing rule /session/{session-id}/* which routes only to a single node (where the session was created).

I am sure that I am over-simplifying things, but I am curious what others think. My goal is to run a grid with 250+ nodes.

schmidtkp · 2017-02-27T15:11:36Z

In my grid configuration, I have several servlets registered as part of the hub/node configurations. Once a node is delegated a new session by the hub, the client can call those servlets, talking directly to the node. Conceptually, this is what #1 is suggesting to do for all communication once the session is established.

However, I recently setup a grid in AWS, to allow for individual teams to create/autoscale nodes as needed, but due to our AWS security architecture/VPC requirements, I had to change all servlets directly communication to a node to be routed to the hub through a public-facing AWS ELB. If I left the servlets as they were, I would also have to create a public-facing ELB's for each node EC2 instance created, which would increase our AWS costs and probably the CloudFormation complexity.

To date, we have not leveraged the AWS grid extensively, to determine if this will present a bottleneck at the hub. I'm hopeful that it will work well and alleviate our resource issue, currently limited to 56 nodes. If it is an issue, at least I can try increasing the Jetty thread max for the hub and also change the EC2 instance to be "beefier" and tuned for network performance.

I'm just putting this out there with respect to solution #1 as to how it could impede this AWS-based grid.

gregjhogan · 2017-02-27T16:30:58Z

@schmidtkp can you share more info on how you get the client (running the test) to talk directly to the node after the session is created? In my case we are using protractor to run the tests, and I am not sure how I would accomplish what you described.

I have something running in Azure, and I took a different approach. I created a few hundred VMs running selenium in standalone mode behind an internal load balancer which sprays the HTTP requests across all the nodes (disregarding which node should get each request). I then wrote something that sits in front of selenium on each node which a) handles finding an available node for new session requests and b) proxies existing session requests to the appropriate node based on session ID. It works, but it certainly isn't ideal.

schmidtkp · 2017-02-27T17:57:50Z

@gregjhogan I probably didn't explain things very well. I'm not doing anything like you are describing. The servlets I refer to that get set in the hub/node configuration are servlets that a client can explicitly call to perform specific actions - e.g. transfer a file to/from client/node, reboot the node, kill a process on the node, etc. They are not used to handle/intercept the specific selenium HTTP traffic between client/hub/node.

I was just describing a use-case I have were I think your proposed solution #1 would break what I'm doing in AWS.

gregjhogan · 2017-02-27T18:44:21Z

@schmidtkp I understand now, thanks for sharing!

krmahadevan · 2017-03-12T16:29:25Z

@gregjhogan

The hub no longer proxies all requests - Use the hub to request a new session (let it find and choose a node), return the hostname/IP of the node back to the client, then have the client talk directly to the node to run the test.

The problem with this approach is that

You would need to have a custom implementation of RemoteWebDriver wherein after the new instance of RemoteWebDriver is created, you would need to pop out the HttpCommandExecutor instance from it and then overwrite it with the IP and port address of the node to which the session was routed to. Now all your subsequent executions would directly start hitting the node and once you are ready to quit, you fall back to the original HttpCommandExecutor.

Here's a working example of this would look like in Java

package com.rationaleemotions.webdriver;

import com.rationaleemotions.GridApiAssistant;
import com.rationaleemotions.pojos.Host;
import org.openqa.selenium.remote.*;

import java.io.IOException;
import java.lang.reflect.Field;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.net.URL;

public class GridGames {
    public static void main(String[] args) throws Exception {
        RemoteWebDriver driver = null;
        Host hub = new Host("localhost", "4444");
        try {
            driver = new RemoteWebDriver(new URL("http://localhost:4444/wd/hub"), DesiredCapabilities.firefox());
            CommandExecutor grid = driver.getCommandExecutor();
            String sessionId = driver.getSessionId().toString();
            GridApiAssistant assistant = new GridApiAssistant(hub);
            Host nodeHost = assistant.getNodeDetailsForSession(sessionId);
            CommandExecutor node = new HttpCommandExecutor(new URL(String.format("http://%s:%d/wd/hub", nodeHost.getIpAddress(),
                    nodeHost.getPort())));
            CommandCodec commandCodec = getCodec(grid, "commandCodec");
            ResponseCodec responseCodec = getCodec(grid, "responseCodec");
            setCodec(node, commandCodec, "commandCodec");
            setCodec(node, responseCodec, "responseCodec");
            appendListenerToWebDriver(driver, grid, node);
            driver.get("https://the-internet.herokuapp.com/");
            System.err.println("Page Title " + driver.getTitle());
        } finally {
            if (driver != null) {
                driver.quit();
            }
        }
    }

    @SuppressWarnings("unchecked")
    private static <T> T getCodec(CommandExecutor executor, String fieldName) throws Exception {
        Class clazz = executor.getClass();
        Field field = clazz.getDeclaredField(fieldName);
        field.setAccessible(true);
        return (T) field.get(executor);
    }

    private static <T> void setCodec(CommandExecutor executor, T codec, String fieldName) throws Exception {
        Class clazz = executor.getClass();
        Field field = clazz.getDeclaredField(fieldName);
        field.setAccessible(true);
        field.set(executor, codec);
    }

    @SuppressWarnings("unchecked")
    private static void appendListenerToWebDriver(RemoteWebDriver rwd, CommandExecutor grid, CommandExecutor node) {
        CommandExecutor executor = new CustomCommandExecutor(grid, node);
        Class clazz = rwd.getClass();
        while (!RemoteWebDriver.class.equals(clazz)) {
            clazz = clazz.getSuperclass();
        }
        try {
            Method m = clazz.getDeclaredMethod("setCommandExecutor", CommandExecutor.class);
            m.setAccessible(true);
            m.invoke(rwd, executor);
        } catch (NoSuchMethodException | InvocationTargetException | IllegalAccessException e) {
            throw new RuntimeException(e);
        }
    }


    public static class CustomCommandExecutor implements CommandExecutor {
        private CommandExecutor grid;
        private CommandExecutor node;

        CustomCommandExecutor(CommandExecutor grid, CommandExecutor node) {
            this.grid = grid;
            this.node = node;
        }

        @Override
        public Response execute(Command command) throws IOException {
            String url;
            Response response;
            if (DriverCommand.QUIT.equals(command.getName())) {
                response = grid.execute(command);
                url = ((HttpCommandExecutor) grid).getAddressOfRemoteServer().toString();
            } else {
                response = node.execute(command);
                url = ((HttpCommandExecutor) node).getAddressOfRemoteServer().toString();
            }
            System.err.println("Hitting the URL : " + url);
            return response;
        }
    }
}

This above code snippet makes use of a small library that I built which makes it easy to interact with the Grid APIs viz., Talk2Grid

With this above logic, the traffic on the Grid would significantly come down because now the tests would be talking to the Grid only for 2 things viz.,

New Session
End Session

For everything else, the tests directly talk to the node, by-passing the Hub.

I am not sure as to how much would this ease the pressure on the Hub, but its definitely worth giving a shot at.

gregjhogan · 2017-03-12T16:52:35Z

@krmahadevan that sounds like a great solution. However, in my situation I am testing Angular websites using Protractor. There isn't a way for me to inject something like this into the javascript webdriver in protractor (without changing protractor) is there?

Perhaps the webdriver could be enhanced to support both the current mode of operation (all traffic through hub) and a new mode of operation (direct node communication) with a new config option allowing you to choose which mode you want?

krmahadevan · 2017-03-12T17:16:04Z

There isn't a way for me to inject something like this into the javascript webdriver in protractor (without changing protractor) is there?

@gregjhogan - I am not conversant in javascript, so I don't have an answer for this question

Perhaps the webdriver could be enhanced to support both the current mode of operation (all traffic through hub) and a new mode of operation (direct node communication) with a new config option allowing you to choose which mode you want?

Like I said, this is a change that is required in the client bindings, not in the Grid side. So this would again boil down to how is RemoteWebDriver flavor of Selenium supported in your client bindings. Since this again boils down to me knowing javascript, I must admit, I am not sure what to answer :(

gregjhogan · 2017-03-13T04:03:31Z

@krmahadevan I found a way to intercept all requests in nodejs, and there seems to be an issue with your approach. If traffic doesn't go through the hub, it considers sessions to be orphaned after the timeout has elapsed and kills them off. Did you work around this somehow?

krmahadevan · 2017-03-13T04:49:08Z

@gregjhogan - Duh! Yeah, I completely forgot about that... So the session cleaner logic within the Grid Hub is wrecking havoc there... An easy work-around for that would be to bump these values on the Hub side to an exorbitant value (Not on the nodes but only on the hub)

-timeout (See here)
-browserTimeout set to 0 to simulate an indefinite wait ( See [here] (https://github.com/SeleniumHQ/selenium/blob/master/java/server/src/org/openqa/grid/internal/utils/configuration/StandaloneConfiguration.java#L130-L135) )

Please see if that would work for you

diemol · 2017-03-13T07:55:18Z

I have also seen this issue with scalability when having > 50 nodes.
There is an option, not clearly documented, in the Selenium Grid wiki optional parameters, which says:

Really large (>50 node) Hub installations may need to increase the jetty threads by setting -DPOOL_MAX=512 (or larger) on the java command line.

However, I have never seen this in an example, or explained more in detail. Have you guys used this option? I am interested in this as well.

krmahadevan · 2017-03-13T09:02:28Z

@diemol - By default the Hub spins off a Jetty server which has a thread pool size of 200, which means at any given point in time, the Jetty server would be able to service 200 concurrent requests to any of the servlets that it hosts. So when you bump this value, you are essentially bumping up the number of concurrent requests that the Jetty Server can service. That is more or less what that parameter is all about. But since the Hub acts as the single point of interface to all the nodes behind it, I think its network bandwidth can become quite chatty and hit its max very soon when the hub is put in to a fair of more than 50 nodes.

The sample code that I shared essentially by passes this by reducing the number of requests that go via the Hub, but instead hit the node directly.

Hope that adds some context

diemol · 2017-03-13T09:39:21Z

@krmahadevan, got it. That is what I also understood from the docs, but I was wondering if someone tried it and the obtained results.

Do you know if this parameter is doing the same at the end?

-jettyThreads, -jettyMaxThreads

   <Integer> : max number of threads for Jetty. An unspecified, zero, or
   negative value means the Jetty default value (200) will be used.

I think that having all the requests through the hub is a pro and a con at the same time. This looks more like an architectural change, and perhaps this would include many more classes in the code base.

But I like the idea a lot, I think reducing the traffic through the Hub is more positive at the end.

krmahadevan · 2017-03-13T10:10:40Z

@diemol - The parameter -jettyThreads/-jettyMaxThreads are essentially aimed at setting the size of the org.seleniumhq.jetty9.util.thread.QueuedThreadPool

Oh btw

Really large (>50 node) Hub installations may need to increase the jetty threads by setting -DPOOL_MAX=512 (or larger) on the java command line.

This is obsolete and perhaps is NO LONGER VALID. The documentation needs to be fixed to remove reference to it.
I didn't find any reference to this JVM argument in the codebase. I guess its now controlled via the configuration property -jettyThreads/-jettyMaxThreads

schmidtkp · 2017-03-13T14:11:29Z

@krmahadevan - I understand your solution and the value-add for off-loading all requests through the Hub, but as I stated previously, this would break any AWS-based grid that has to adhere to security considerations that would necessitate driving all public communication through public ELB's.

As mentioned in my previous comment, I had several servlets, that once the session was established, that tests could call to directly communicate with a node. These all had to be re-routed to go through the public ELB to the Hub and then on to the nodes. It would have not been cost effective to create a public ELB's for each node in the grid (as ELB's cost money).

I have jettyMaxThreads=512 set in my Hub configuration and I'm considering upping to 1024 once I can get around to doing some performance analysis. Also, I have the option of changing the EC2 instance type to be more network performant if necessary, which would be more cost effective then having to pay for (N-nodes * ELB's).

Just stating another view point/use case for consideration.

krmahadevan · 2017-03-13T14:16:29Z

@schmidtkp - Fair enough. I don't have experience working with AWS cloud for Selenium Grid solutions. So I cannot comment on that part.

Also, I have the option of changing the EC2 instance type to be more network performant if necessary

I believe this would definitely help especially the machine on which the Hub is running on.

schmidtkp · 2017-03-13T14:19:36Z

@krmahadevan - Keep in mind that for my particular AWS-based grid solution, the security constraint is imposed on me by my company. Therefore, this may not impact others using a cloud-based grid solutions.

krmahadevan · 2017-03-13T14:21:38Z

@schmidtkp - Sure thing :) Oh btw.. on a side note, if you could please help point me to some documentation that details the things related to security on AWS that you are talking about, it would be a good learning exercise for me on AWS...

schmidtkp · 2017-03-13T14:29:03Z

@krmahadevan - Start here: https://aws.amazon.com and explore EC2, for creation of actual instances, S3, for data storage, CloudFormation, for JSON templates to create AWS-based stacks. Stacks define all the AWS resouces you require - e.g. Amazon Machine Images (AMI's), Launch Configurations, Authentication, AutoScaling, Security Groups, Role Profiles, etc...

gregjhogan · 2017-03-14T00:25:42Z

@krmahadevan if I set the timeouts on the hub infinitely high and a session gets orphaned and cleaned up on the node (which seems to happen all too often for us), will the hub ever consider the node available again?

Also, I am curious what people think about adding a node-direct communication mode which you have to opt into (so it doesn't break people like @schmidtkp).

schmidtkp · 2017-03-14T14:39:19Z

I'd be in support of a node-direct communication mode which could be optional. If I didn't have the AWS security/cost constraints I'd use it 👍 .

krmahadevan · 2017-03-14T14:56:52Z

@gregjhogan

if I set the timeouts on the hub infinitely high and a session gets orphaned and cleaned up on the node (which seems to happen all too often for us), will the hub ever consider the node available again?

There are many things to consider here. The value has to be sufficiently high so that it doesn't clean-up a valid test session (thinking that the test session is an orphaned one because the Hub didn't see any activity on it) but sufficiently low such that in case due to the test directly talking to the node, there's a browser crash etc., and the node cleans up the session at its end, eventually the hub gets to cleaning up this rogue session (which is invalid). But in that timespan, the node will not be receiving any new tests, because as per the Hub the session is occupied.. So yep there can be a denial of service. We can plug in this by building a servlet at the Hub end, which when invoked by a test, by passing in a session, the servlet can force cleaning up of the session by accessing the Hub's registry.

node-direct communication mode

To the best of my knowledge, this would require a re-architecturing of the Grid and also some amount of re-architecturing of RemoteWebDriver to support this. Its not something that can be done with some minimal work, considering the fact that the Hub is responsible for caps matching, session creation, session clean-up etc.,

krmahadevan · 2017-03-19T06:39:08Z

@gregjhogan - I decided to enrich a library that I had already built to interact with the Grid's internals (Its called Talk2Grid)... with this capability. Read more about it here.
Thought I would cross post, just in case someone is looking for something like this.

testphreak · 2017-05-17T19:26:29Z

@krmahadevan, @gregjhogan and @diemol How about creating a new role for selenium-server-standalone called NodeProxy which would serve the purpose of being an intermediary for all node-related requests (except new and end session)? When a Node joins the grid it would register with Hub and NodeProxy. When a new request comes in, NodeProxy would query the Hub with a sessionId, get the node information and route requests for that session directly to the nodes instead of the Hub.

This approach would work for the use cases described above, plus the scenario where the Hub and Nodes are part of a Docker Swarm. With the NodeProxy role, it would then be straightforward to have a container with the NodeProxy role join the Swarm.

Of course, creating the NodeProxy role can be made optional when setting up the grid and would be available for those users who want to support large grid installations (> 50 nodes).

Perhaps there's more details that need to be flushed out for this to work, but wanted to get your thoughts on it. Also, may not be as big of an architectural change to implement?

mach6 · 2017-05-17T21:32:11Z

@testphreak Interesting approach.

Based on my read, I have some follow-up questions.

When a new request comes in, NodeProxy would query the Hub with a sessionId, get the node information and route requests for that session directly to the nodes instead of the Hub.

So, the session communication would now flow through the NodeProxy instead of a Hub?

Also, if I'm reading this correctly, it means the Hub would be reduced to a capabilities matcher and a key/value store which tracks sessions and which proxy (node) they are being/will be routed to?

In this model, there's still only one Hub and it's still a potential bottleneck. Correct?

Where does session queueing happen? In the NodeProxy and the Hub, with some new polling mechanism?

To keep it generic and to address scale (ability to put many NodeProxy instances behind a single VIP), I assume any NodeProxy would have to be able to route to any Node (M-M relationship). Is that an accurate assumption?

What happens when subsequent commands for the same session are routed to a different NodeProxy? Is there stickiness required on the connection for each NodeProxy? Does each session command forwarded through a NodeProxy open a new connection to the Node or rely on a persistent connection to the Node? How would a Node handle this type of communication (session commands coming from different clients -- the NodeProxy servers -- which are forwarding the command)?

All-in-all -- I think it would require a bit of changes (perhaps still to the RemoteWebdriver client) to pull off -- which I'm not opposed to since the concern/issue that we are discussing here resonates with me.

testphreak · 2017-05-18T21:55:54Z

@mach6 great thoughts and ideas.

So, the session communication would now flow through the NodeProxy instead of a Hub?

Yes, I was just extending @gregjhogan and @krmahadevan's idea that tests could talk to the hub just for new session and end session calls, while rest of the communication would be via the NodeProxy.

Also, if I'm reading this correctly, it means the Hub would be reduced to a capabilities matcher and a key/value store which tracks sessions and which proxy (node) they are being/will be routed to?

Yes and for new session and end session communication.

In this model, there's still only one Hub and it's still a potential bottleneck. Correct?

Yes, that bottleneck would exist, but be reduced by the fact that all session communication except for new and end session would be handled by the NodeProxy.

Where does session queueing happen? In the NodeProxy and the Hub, with some new polling mechanism?

Yes, there would need to be a new polling mechanism with session queueing happening in both NodeProxy and Hub. Perhaps there's a better way to do it.

To keep it generic and to address scale (ability to put many NodeProxy instances behind a single VIP), I assume any NodeProxy would have to be able to route to any Node (M-M relationship). Is that an accurate assumption?

What happens when subsequent commands for the same session are routed to a different NodeProxy? Is there stickiness required on the connection for each NodeProxy? Does each session command forwarded through a NodeProxy open a new connection to the Node or rely on a persistent connection to the Node? How would a Node handle this type of communication (session commands coming from different clients -- the NodeProxy servers -- which are forwarding the command)?

That's a great idea and something I hadn't thought through. In the case there are multiple NodeProxy instances, stickiness and persistent connection to the node would be required so requests don't get routed to a node that doesn't have the browser related to the test running. Maybe others can chime in on how this scenario can be better handled.

gregjhogan · 2017-05-22T00:21:59Z

@testphreak that sounds like it would work as a solution for me. I feel like you are talking about building a high-performance layer 7 reverse proxy which (as I mentioned in the original message) already exists (haproxy/nginx) and seems like a good fit based on the fact that the routing rules in the proxy would be /session/{session-id}/* -> specific node. Maybe we just need a hub that registers/manages these session id based rules in such a proxy.

diemol · 2020-04-07T15:55:37Z

Grid 4 (which is currently in the alphas, while this comment is being written) has been thought in a way to enable scalability in a more straightforward way. It should tackle several of the problems mentioned here. It can be tried out now, please check https://www.selenium.dev/downloads/
Docs for it are work in progress, but keep an eye to the future releases.

I will close this since there is no clear actionable item from this thread, and as mentioned, several improvements have been implemented for Grid 4.

lock · 2020-05-20T23:22:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

barancev added the C-grid label Mar 3, 2017

diemol closed this as completed Apr 7, 2020

lock bot locked and limited conversation to collaborators May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve support for running large-scale grids #3574

Improve support for running large-scale grids #3574

gregjhogan commented Feb 26, 2017 •

edited

Loading

schmidtkp commented Feb 27, 2017

gregjhogan commented Feb 27, 2017

schmidtkp commented Feb 27, 2017

gregjhogan commented Feb 27, 2017

krmahadevan commented Mar 12, 2017

gregjhogan commented Mar 12, 2017

krmahadevan commented Mar 12, 2017

gregjhogan commented Mar 13, 2017

krmahadevan commented Mar 13, 2017 •

edited

Loading

diemol commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

diemol commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

schmidtkp commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

schmidtkp commented Mar 13, 2017

krmahadevan commented Mar 13, 2017 •

edited

Loading

schmidtkp commented Mar 13, 2017

gregjhogan commented Mar 14, 2017

schmidtkp commented Mar 14, 2017

krmahadevan commented Mar 14, 2017

krmahadevan commented Mar 19, 2017

testphreak commented May 17, 2017

mach6 commented May 17, 2017 •

edited

Loading

testphreak commented May 18, 2017

gregjhogan commented May 22, 2017

diemol commented Apr 7, 2020

lock bot commented May 20, 2020

Improve support for running large-scale grids #3574

Improve support for running large-scale grids #3574

Comments

gregjhogan commented Feb 26, 2017 • edited Loading

schmidtkp commented Feb 27, 2017

gregjhogan commented Feb 27, 2017

schmidtkp commented Feb 27, 2017

gregjhogan commented Feb 27, 2017

krmahadevan commented Mar 12, 2017

gregjhogan commented Mar 12, 2017

krmahadevan commented Mar 12, 2017

gregjhogan commented Mar 13, 2017

krmahadevan commented Mar 13, 2017 • edited Loading

diemol commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

diemol commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

schmidtkp commented Mar 13, 2017

krmahadevan commented Mar 13, 2017

schmidtkp commented Mar 13, 2017

krmahadevan commented Mar 13, 2017 • edited Loading

schmidtkp commented Mar 13, 2017

gregjhogan commented Mar 14, 2017

schmidtkp commented Mar 14, 2017

krmahadevan commented Mar 14, 2017

krmahadevan commented Mar 19, 2017

testphreak commented May 17, 2017

mach6 commented May 17, 2017 • edited Loading

testphreak commented May 18, 2017

gregjhogan commented May 22, 2017

diemol commented Apr 7, 2020

lock bot commented May 20, 2020

gregjhogan commented Feb 26, 2017 •

edited

Loading

krmahadevan commented Mar 13, 2017 •

edited

Loading

krmahadevan commented Mar 13, 2017 •

edited

Loading

mach6 commented May 17, 2017 •

edited

Loading