-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check for running workers using the sockets instead of the worker-specific API #31
Comments
Current thinking on logic for launching and auto-scaling:
|
I like this because it does not require waiting for workers to start, and relies completely on the connection and the startup time to tell if a worker is running. |
On second thought, maximum startup time is not very reliable. It can guard against nightmare scenarios, but so many things can affect the startup time of a worker, and it is too hard to predict even in specific cases. I think
|
c.f. #32 (comment) |
Would it help you if I simply add a count to each server - tasks started, tasks completed for each node. Then you can just get this by calling |
That would help so much! Tasks started would let me really know how many workers can accept new tasks (if I also take into account the "expected" workers which are starting up but have not yet connected to the client). And tasks completed is such useful load balancing data. In the case of persistent workers, it could really help users figure out if e.g. they really need 50 workers or they can scale down to 25. |
As long as those counts are refreshed if the |
Give shikokuchuo/mirai@ba5e84e (v0.7.2.9026) a try. Should give you everything you need. As a happy side effect, the active queue keeps getting more efficient even as we add more features. |
Fantastic! I will test as soon as I have another free moment. (Trying to juggle other projects too, so this one may take me a few days.) |
Cool. You'll want to pick up shikokuchuo/mirai@3777610 (v0.7.2.9028) instead. Having gone through #32 (comment)_ in more detail, I think this now has what you need. |
I tested the specific worker counters with shikokuchuo/mirai@51f2f80, and they work perfectly. |
I did more thinking about auto-scaling logic, and I care a lot about whether a To determine if a server is active, I need three more definitions:
If the server is connected, then it is automatically active. If it is disconnected, then we should only consider it active if it is launching and not yet discovered. |
Scenario for slow-to-launch transient workers:
|
Unfortunately, the counts in |
Re. your diagram, I am not sure you need to be so explicit. I believe the If status is online, you have an active server. Otherwise you know that a server has either never connected (zero tasks columns) or disconnected (non-zero tasks columns). In both cases you check when it was launched by |
If the server ran and then disconnected, I would strongly prefer not to wait for the rest of the expiry time. The expiry time in Unfortunately, I cannot currently tell if this is happening. A lot of the time, I call status_online status_busy tasks_assigned tasks_complete
ws://127.0.0.1:5000 0 0 1 1 If we are still inside the expiry time, then I cannot tell if the server already connected and disconnected, or if the worker is starting and these counts are left over from the previous server that used websocket Does this make sense? Would it be possible to add a new websocket-specific counter in |
I think you have look at this from the perspective that
Once you have re-launched then you are effectively back to step 1. You should know your state at all times. |
For a quick task and Previously I though of a workaround where I would make each task tell me which websocket it came from. This would solve some scenarios, but not a scenario where If state is an obstacle for |
Online status may go from 0 to 1 to 0 again, but the snapshot will also show non-zero tasks against the zero status. So you know the server has completed one task in this case and disconnected. You may be missing the fact that tasks only get zeroed when a new server connects. So if you don't choose to start up a new server to connect to this port, the stats will never change. |
This does not sound very plausible to me - you have something that takes potentially a very long time to spin up stay online for a very short time and have it exit without carrying out a task. As per my last answer - at each point you want to start up a new server, you have access to all the stats you need. You don't have to poll outside of those points. You won't lose stats because somehow you weren't quick enough to catch them - only a new server connection clears out the previous stats. |
It is not at all an obstacle, but there must be some part of the |
This is exactly where I am struggling: as you say, the tasks only get zeroed when a new server starts. So if I start my second server at the websocket and observe A major goal of
And I could definitely avoid it by requiring
I could just as easily be missing something, I am finding it challenging (and interesting) to wrap my head around this problem. Thank you for sticking with me on this. If the first part of my answer does not make sense, please let me know and I may be able to describe it a different way. |
Just to be clear, this is totally fine for the first server that dials into a websocket. The delay due to leftover counts only happens when subsequent servers launch at the same websocket long after the first server disconnects. But this is important: Just to make sure
|
To quote @brendanf from #32 (comment):
So the posited 30-minute expiry time from #31 (comment) may not be nearly enough for some users. |
You know what? I may be able to handle all this in crew_worker <- function(socket, uuid, ...) {
server_socket <- nanonext::socket(protocol = "req", dial = socket)
on.exit(nanonext::send(con = server_socket, data = uuid)) # Tell the client when the server is done.
mirai::server(url = socket, ...)
} On the server process: crew_worker("ws://192.168.0.2:5000/finished_servers", uuid = "MY_UUID", idletime = 100, tasklimit = 0) On the client: sock <- nanonext::socket(protocol = "rep", listen = "ws://192.168.0.2:5000/finished_servers")
# ... Do some other work, do not poll at regular intervals.
uuid <- nanonext::recv(sock) # Did any workers finish since last I checked?
# ... Check the uuid against the known set of UUIDs I submitted workers with. There is still a slim possibility that |
Would I need an additional TCP port for this? I see: library(mirai)
library(nanonext)
daemons("ws://127.0.0.1:5000/1", nodes = 1)
#> [1] 1
connection_for_done_worker <- nanonext::socket(protocol = "rep", listen = "ws://127.0.0.1:5000/done")
#> Error in nanonext::socket(protocol = "rep", listen = "ws://127.0.0.1:5000/done") :
#> 10 | Address in use |
Shouldn't do. |
In that case, how would you recommend I listen to ws://127.0.0.1:5000/done using |
Well the error message says address in use, so have you tried other random paths? no path? I don't see anything obviously wrong. |
I just tried with different paths/ports, both on the loopback address and with library(mirai)
library(nanonext)
daemons("ws://127.0.0.1:61903/path1", nodes = 1)
#> [1] 0
connection_for_done_worker <- socket(protocol = "rep", listen = "ws://127.0.0.1:61903/path2")
#> Error in socket(protocol = "rep", listen = "ws://127.0.0.1:61903/path2") :
#> 10 | Address in use |
Oh, because you are listening using |
I will throw in this suggestion though as it will be less error-prone, and that it to just establish a connection and not actually send any messages. First use the 'bus' protocol as that is the lightest: connection_for_done_worker[[1L]] <- socket(protocol = "bus", listen = "ws://127.0.0.1:61903/UUID") etc. stat(connection_for_done_worker[[1L]]$listener[[1L]], "accept") will give you the total number of connections accepted at the listener (1 if the server has dialled in). stat(connection_for_done_worker[[1L]]$listener[[1L]], "pipes") will give you the number of current connections. (0 if server has disconnected). So a combination of 1 and 0 above means the server has dialled in and disconnected after presumably finishing its tasks. |
Wow! This is so much better than trying to catch messages! So much easier. Thank you so much! |
Notes to self:
|
Re #31 (comment),
I am actually thinking of using these custom sockets to also send common data for #33. Would I use the push/pull protocol for that? Would the client use a |
You wouldn't use a push/pull unless you had a very specific need to ensure flow is only one way. The fewer semantics the better (unless you need guaranteed delivery in which case use req/rep). If it is one-to-one I would just use 'bus', or 'pair' if you want to be sure it is one-to-one (this won't allow 2 processes dialing into one side for example). I'm not sure how exactly you plan to implement #33 but a typical pattern that works well is for one party to request an asynchronous receive |
On this point - I have not been following what this automatic updating is about. But simply sending the |
True, but would it be fast? I am thinking of a persistent workers scenario where all tasks operate on a shared common large in-memory object that is slow to serialize. I would rather send that large object once per persistent worker (infrequently) rather than once per task (much more frequently). |
If e.g. daemons(n= 8, common_data= list(...)) were to send the common data for servers to pick up, then I could definitely rely on mirai for this. |
How large is large to give me some idea? |
In the general case for a targets pipeline, it could be a large fraction of the allowable R session memory, e.g. 2-4 GB. Small enough to all be in the global env, but slow to serialize and send over the local network. |
Right, I would estimate that would take a couple of seconds, a few seconds perhaps. Not an obstacle I'd say, in the absence of a better alternative. |
If you send this common object, is it meant to be immutable then, and if so how do you ensure that? Maybe you know what the right answer is already, but to me there are many potential pitfalls! |
Yes, it is meant to be immutable. Although immutability is hard to strictly enforce here, In
|
It sounds like you have found a way that works which is good. I put a heavy premium on correctness so it would take a lot to convince me any of this is worthwhile. I would willingly spend the extra 5s per task. |
Fair enough, I will go ahead with my original plan to implement common data in crew using bus sockets. |
Is your plan to put this common data in into the global environment of all the servers? |
Pretty much. I am writing a wrapper that will recv() the data, assign it to the global env, then call mirai::server(). Looking back at your comments, maybe I do need guaranteed delivery for this (is it exactly what it sounds like?), which puts me with rep/req. Should I use rep/listen for the client and req/dial for the servers? |
Closing this thread because it digressed to #33, and before that, @shikokuchuo solved it with #31 (comment). |
shikokuchuo/mirai#33 (comment)
The text was updated successfully, but these errors were encountered: