-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve support for running large-scale grids #3574
Comments
In my grid configuration, I have several servlets registered as part of the hub/node configurations. Once a node is delegated a new session by the hub, the client can call those servlets, talking directly to the node. Conceptually, this is what #1 is suggesting to do for all communication once the session is established. However, I recently setup a grid in AWS, to allow for individual teams to create/autoscale nodes as needed, but due to our AWS security architecture/VPC requirements, I had to change all servlets directly communication to a node to be routed to the hub through a public-facing AWS ELB. If I left the servlets as they were, I would also have to create a public-facing ELB's for each node EC2 instance created, which would increase our AWS costs and probably the CloudFormation complexity. To date, we have not leveraged the AWS grid extensively, to determine if this will present a bottleneck at the hub. I'm hopeful that it will work well and alleviate our resource issue, currently limited to 56 nodes. If it is an issue, at least I can try increasing the Jetty thread max for the hub and also change the EC2 instance to be "beefier" and tuned for network performance. I'm just putting this out there with respect to solution #1 as to how it could impede this AWS-based grid. |
@schmidtkp can you share more info on how you get the client (running the test) to talk directly to the node after the session is created? In my case we are using protractor to run the tests, and I am not sure how I would accomplish what you described. I have something running in Azure, and I took a different approach. I created a few hundred VMs running selenium in standalone mode behind an internal load balancer which sprays the HTTP requests across all the nodes (disregarding which node should get each request). I then wrote something that sits in front of selenium on each node which a) handles finding an available node for new session requests and b) proxies existing session requests to the appropriate node based on session ID. It works, but it certainly isn't ideal. |
@gregjhogan I probably didn't explain things very well. I'm not doing anything like you are describing. The servlets I refer to that get set in the hub/node configuration are servlets that a client can explicitly call to perform specific actions - e.g. transfer a file to/from client/node, reboot the node, kill a process on the node, etc. They are not used to handle/intercept the specific selenium HTTP traffic between client/hub/node. I was just describing a use-case I have were I think your proposed solution #1 would break what I'm doing in AWS. |
@schmidtkp I understand now, thanks for sharing! |
The problem with this approach is that
Here's a working example of this would look like in Java package com.rationaleemotions.webdriver;
import com.rationaleemotions.GridApiAssistant;
import com.rationaleemotions.pojos.Host;
import org.openqa.selenium.remote.*;
import java.io.IOException;
import java.lang.reflect.Field;
import java.lang.reflect.InvocationTargetException;
import java.lang.reflect.Method;
import java.net.URL;
public class GridGames {
public static void main(String[] args) throws Exception {
RemoteWebDriver driver = null;
Host hub = new Host("localhost", "4444");
try {
driver = new RemoteWebDriver(new URL("http://localhost:4444/wd/hub"), DesiredCapabilities.firefox());
CommandExecutor grid = driver.getCommandExecutor();
String sessionId = driver.getSessionId().toString();
GridApiAssistant assistant = new GridApiAssistant(hub);
Host nodeHost = assistant.getNodeDetailsForSession(sessionId);
CommandExecutor node = new HttpCommandExecutor(new URL(String.format("http://%s:%d/wd/hub", nodeHost.getIpAddress(),
nodeHost.getPort())));
CommandCodec commandCodec = getCodec(grid, "commandCodec");
ResponseCodec responseCodec = getCodec(grid, "responseCodec");
setCodec(node, commandCodec, "commandCodec");
setCodec(node, responseCodec, "responseCodec");
appendListenerToWebDriver(driver, grid, node);
driver.get("https://the-internet.herokuapp.com/");
System.err.println("Page Title " + driver.getTitle());
} finally {
if (driver != null) {
driver.quit();
}
}
}
@SuppressWarnings("unchecked")
private static <T> T getCodec(CommandExecutor executor, String fieldName) throws Exception {
Class clazz = executor.getClass();
Field field = clazz.getDeclaredField(fieldName);
field.setAccessible(true);
return (T) field.get(executor);
}
private static <T> void setCodec(CommandExecutor executor, T codec, String fieldName) throws Exception {
Class clazz = executor.getClass();
Field field = clazz.getDeclaredField(fieldName);
field.setAccessible(true);
field.set(executor, codec);
}
@SuppressWarnings("unchecked")
private static void appendListenerToWebDriver(RemoteWebDriver rwd, CommandExecutor grid, CommandExecutor node) {
CommandExecutor executor = new CustomCommandExecutor(grid, node);
Class clazz = rwd.getClass();
while (!RemoteWebDriver.class.equals(clazz)) {
clazz = clazz.getSuperclass();
}
try {
Method m = clazz.getDeclaredMethod("setCommandExecutor", CommandExecutor.class);
m.setAccessible(true);
m.invoke(rwd, executor);
} catch (NoSuchMethodException | InvocationTargetException | IllegalAccessException e) {
throw new RuntimeException(e);
}
}
public static class CustomCommandExecutor implements CommandExecutor {
private CommandExecutor grid;
private CommandExecutor node;
CustomCommandExecutor(CommandExecutor grid, CommandExecutor node) {
this.grid = grid;
this.node = node;
}
@Override
public Response execute(Command command) throws IOException {
String url;
Response response;
if (DriverCommand.QUIT.equals(command.getName())) {
response = grid.execute(command);
url = ((HttpCommandExecutor) grid).getAddressOfRemoteServer().toString();
} else {
response = node.execute(command);
url = ((HttpCommandExecutor) node).getAddressOfRemoteServer().toString();
}
System.err.println("Hitting the URL : " + url);
return response;
}
}
} This above code snippet makes use of a small library that I built which makes it easy to interact with the Grid APIs viz., Talk2Grid With this above logic, the traffic on the Grid would significantly come down because now the tests would be talking to the Grid only for 2 things viz.,
For everything else, the tests directly talk to the node, by-passing the Hub. I am not sure as to how much would this ease the pressure on the Hub, but its definitely worth giving a shot at. |
@krmahadevan that sounds like a great solution. However, in my situation I am testing Angular websites using Protractor. There isn't a way for me to inject something like this into the javascript webdriver in protractor (without changing protractor) is there? Perhaps the webdriver could be enhanced to support both the current mode of operation (all traffic through hub) and a new mode of operation (direct node communication) with a new config option allowing you to choose which mode you want? |
@gregjhogan - I am not conversant in javascript, so I don't have an answer for this question
Like I said, this is a change that is required in the client bindings, not in the Grid side. So this would again boil down to how is |
@krmahadevan I found a way to intercept all requests in nodejs, and there seems to be an issue with your approach. If traffic doesn't go through the hub, it considers sessions to be orphaned after the timeout has elapsed and kills them off. Did you work around this somehow? |
@gregjhogan - Duh! Yeah, I completely forgot about that... So the session cleaner logic within the Grid Hub is wrecking havoc there... An easy work-around for that would be to bump these values on the Hub side to an exorbitant value (Not on the nodes but only on the hub)
Please see if that would work for you |
I have also seen this issue with scalability when having > 50 nodes.
However, I have never seen this in an example, or explained more in detail. Have you guys used this option? I am interested in this as well. |
@diemol - By default the Hub spins off a Jetty server which has a thread pool size of 200, which means at any given point in time, the Jetty server would be able to service 200 concurrent requests to any of the servlets that it hosts. So when you bump this value, you are essentially bumping up the number of concurrent requests that the Jetty Server can service. That is more or less what that parameter is all about. But since the Hub acts as the single point of interface to all the nodes behind it, I think its network bandwidth can become quite chatty and hit its max very soon when the hub is put in to a fair of more than 50 nodes. The sample code that I shared essentially by passes this by reducing the number of requests that go via the Hub, but instead hit the node directly. Hope that adds some context |
@krmahadevan, got it. That is what I also understood from the docs, but I was wondering if someone tried it and the obtained results. Do you know if this parameter is doing the same at the end?
I think that having all the requests through the hub is a pro and a con at the same time. This looks more like an architectural change, and perhaps this would include many more classes in the code base. But I like the idea a lot, I think reducing the traffic through the Hub is more positive at the end. |
@diemol - The parameter Oh btw
This is obsolete and perhaps is NO LONGER VALID. The documentation needs to be fixed to remove reference to it. |
@krmahadevan - I understand your solution and the value-add for off-loading all requests through the Hub, but as I stated previously, this would break any AWS-based grid that has to adhere to security considerations that would necessitate driving all public communication through public ELB's. As mentioned in my previous comment, I had several servlets, that once the session was established, that tests could call to directly communicate with a node. These all had to be re-routed to go through the public ELB to the Hub and then on to the nodes. It would have not been cost effective to create a public ELB's for each node in the grid (as ELB's cost money). I have jettyMaxThreads=512 set in my Hub configuration and I'm considering upping to 1024 once I can get around to doing some performance analysis. Also, I have the option of changing the EC2 instance type to be more network performant if necessary, which would be more cost effective then having to pay for (N-nodes * ELB's). Just stating another view point/use case for consideration. |
@schmidtkp - Fair enough. I don't have experience working with AWS cloud for Selenium Grid solutions. So I cannot comment on that part.
I believe this would definitely help especially the machine on which the Hub is running on. |
@krmahadevan - Keep in mind that for my particular AWS-based grid solution, the security constraint is imposed on me by my company. Therefore, this may not impact others using a cloud-based grid solutions. |
@schmidtkp - Sure thing :) Oh btw.. on a side note, if you could please help point me to some documentation that details the things related to security on AWS that you are talking about, it would be a good learning exercise for me on AWS... |
@krmahadevan - Start here: https://aws.amazon.com and explore EC2, for creation of actual instances, S3, for data storage, CloudFormation, for JSON templates to create AWS-based stacks. Stacks define all the AWS resouces you require - e.g. Amazon Machine Images (AMI's), Launch Configurations, Authentication, AutoScaling, Security Groups, Role Profiles, etc... |
@krmahadevan if I set the timeouts on the hub infinitely high and a session gets orphaned and cleaned up on the node (which seems to happen all too often for us), will the hub ever consider the node available again? Also, I am curious what people think about adding a node-direct communication mode which you have to opt into (so it doesn't break people like @schmidtkp). |
I'd be in support of a node-direct communication mode which could be optional. If I didn't have the AWS security/cost constraints I'd use it 👍 . |
There are many things to consider here. The value has to be sufficiently high so that it doesn't clean-up a valid test session (thinking that the test session is an orphaned one because the Hub didn't see any activity on it) but sufficiently low such that in case due to the test directly talking to the node, there's a browser crash etc., and the node cleans up the session at its end, eventually the hub gets to cleaning up this rogue session (which is invalid). But in that timespan, the node will not be receiving any new tests, because as per the Hub the session is occupied.. So yep there can be a denial of service. We can plug in this by building a servlet at the Hub end, which when invoked by a test, by passing in a session, the servlet can force cleaning up of the session by accessing the Hub's registry.
To the best of my knowledge, this would require a re-architecturing of the Grid and also some amount of re-architecturing of |
@gregjhogan - I decided to enrich a library that I had already built to interact with the Grid's internals (Its called Talk2Grid)... with this capability. Read more about it here. |
@krmahadevan, @gregjhogan and @diemol How about creating a new role for This approach would work for the use cases described above, plus the scenario where the Of course, creating the Perhaps there's more details that need to be flushed out for this to work, but wanted to get your thoughts on it. Also, may not be as big of an architectural change to implement? |
@testphreak Interesting approach. Based on my read, I have some follow-up questions.
So, the session communication would now flow through the Also, if I'm reading this correctly, it means the In this model, there's still only one Where does session queueing happen? In the To keep it generic and to address scale (ability to put many What happens when subsequent commands for the same session are routed to a different All-in-all -- I think it would require a bit of changes (perhaps still to the |
@mach6 great thoughts and ideas.
Yes, I was just extending @gregjhogan and @krmahadevan's idea that tests could talk to the hub just for new session and end session calls, while rest of the communication would be via the
Yes and for new session and end session communication.
Yes, that bottleneck would exist, but be reduced by the fact that all session communication except for new and end session would be handled by the
Yes, there would need to be a new polling mechanism with session queueing happening in both
That's a great idea and something I hadn't thought through. In the case there are multiple |
@testphreak that sounds like it would work as a solution for me. I feel like you are talking about building a high-performance layer 7 reverse proxy which (as I mentioned in the original message) already exists (haproxy/nginx) and seems like a good fit based on the fact that the routing rules in the proxy would be /session/{session-id}/* -> specific node. Maybe we just need a hub that registers/manages these session id based rules in such a proxy. |
Grid 4 (which is currently in the alphas, while this comment is being written) has been thought in a way to enable scalability in a more straightforward way. It should tackle several of the problems mentioned here. It can be tried out now, please check https://www.selenium.dev/downloads/ I will close this since there is no clear actionable item from this thread, and as mentioned, several improvements have been implemented for Grid 4. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
My experience has been that the hub can be a real bottle neck when scaling above 50 nodes. It appears others have run into scalability issues, too, based on projects like seleniumkit/gridrouter.
I can think of a couple potential solutions:
The hub no longer proxies all requests - Use the hub to request a new session (let it find and choose a node), return the hostname/IP of the node back to the client, then have the client talk directly to the node to run the test.
Use a high performance layer 7 application proxy (like nginx) - Maybe I am wrong, but I thought this might increase scalability of the hub component. I feel like this could be a natural fit for layer 7 URL based routing. The /session requests would get routed to the hub node, which would dynamically inject a URL path based routing rule /session/{session-id}/* which routes only to a single node (where the session was created).
I am sure that I am over-simplifying things, but I am curious what others think. My goal is to run a grid with 250+ nodes.
The text was updated successfully, but these errors were encountered: