Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Necessity of linger on exit for servers that time out #42

Closed
shikokuchuo opened this issue Mar 23, 2023 · 7 comments
Closed

Necessity of linger on exit for servers that time out #42

shikokuchuo opened this issue Mar 23, 2023 · 7 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@shikokuchuo
Copy link
Owner

As servers have the option to time out or task-out after a set number of tasks, it would be ideal to exit the process immediately thereafter - however, at present, this is only possible after an 'exitlinger' period, which by default is set to 1s. This should be sufficient for sending objects of ~ 1GB in size.

What is currently not possible is for exit to be conditional upon the send being completed.

This is, I believe, due to:

  1. If no linger period is implemented in R, the interpreter thinks execution has ended and reaps all child threads even though the send is in progress asynchronously at the C level.
  2. C functions that are part of the NNG library do not help as sends are recorded as complete once the socket accepts the message for transport. That means that NNG's definition of a send being complete only means the responsibility is transferred to the system sockets. However this does not guarantee that the send actually completes if the process is reaped in the meantime.

It would be great if a solution can be found.

@shikokuchuo shikokuchuo added enhancement New feature or request help wanted Extra attention is needed labels Mar 23, 2023
@shikokuchuo
Copy link
Owner Author

shikokuchuo commented Mar 24, 2023

This man page for socket close suggests it may not be possible through the existing NNG interface: https://nng.nanomsg.org/man/tip/nng_close.3.html

Closing the socket while data is in transmission will likely lead to loss of that data. There is no automatic linger or flush to ensure that the socket send buffers have completely transmitted. It is recommended to wait a brief period after calling nng_send() or similar functions, before calling this function.

@wlandau
Copy link

wlandau commented May 8, 2023

In the case of long-running computation, it seems like this would matter most when sending the result of a completed task back to the client, rather than receiving data for a new task. And in the former case, would it be possible for the server to pause its idle timers etc. before initiating a send? Unless I am missing something, it seems like this would just be a matter of expressing the timer logic differently in R.

@shikokuchuo
Copy link
Owner Author

The issue is we can do what we like prior to the send, or afterwards for that matter. But we just simply do not know when it has finished. As that is an interplay between the C process and the system TCP stack, that R has no access to at present.

@wlandau
Copy link

wlandau commented May 8, 2023

That makes sense.

By the way, this discussion made me concerned that a server could exit and lose the data far before the client has a chance to download it. I am happy to see that lightweight tasks seem to be available somewhere well after the server exits. On my company's cluster, I started a dispatcher on one node:

library(mirai)
url <- sprintf("ws://%s:57000", getip::getip())
print(url)
daemons(
  n = 1L,
  url = url,
  dispatcher = TRUE,
  token = FALSE
)
while (!is.matrix(daemons()$daemons)) {
  Sys.sleep(0.1)
}
while (daemons()$daemons[, "online"] < 1L) {
  Sys.sleep(0.1)
}
tasks <- replicate(4, mirai(rnorm(n = 1)))
Sys.sleep(4)
print(as.numeric(lapply(tasks, function(task) task$data)))

During the while() loop with daemons()$daemons[, "online"] , I launched a server on a different node on the local network:

R -e 'mirai::server(url = "ws://x.x.x.x:57000", idletime = 1000, exitlinger = 1000)'

The server visibly came and went, and the client did not make an attempt to collect the data until a couple seconds after that. But yet no result went missing!

print(as.numeric(lapply(tasks, function(task) task$data)))
#> [1]  1.3502759 -0.2049120  0.1465165 -0.5801425

This is really amazing. Where do the results live between the server exit and the moment the client starts to collect them?

@shikokuchuo
Copy link
Owner Author

That makes sense.

By the way, this discussion made me concerned that a server could exit and lose the data far before the client has a chance to download it. I am happy to see that lightweight tasks seem to be available somewhere well after the server exits. On my company's cluster, I started a dispatcher on one node:

Ha yes TCP is surprisingly resilient.

During the while() loop with daemons()$daemons[, "online"] , I launched a server on a different node on the local network:

R -e 'mirai::server(url = "ws://x.x.x.x:57000", idletime = 1000, exitlinger = 1000)'

The server visibly came and went, and the client did not make an attempt to collect the data until a couple seconds after that. But yet no result went missing!

The send is eager so it is done when the server is still alive. <- This though assumes it finishes transmitting before the 'exitlinger' period and the process dies.

This is really amazing. Where do the results live between the server exit and the moment the client starts to collect them?

I believe the data is just buffered at the client (listener) TCP socket, so it can be collected at any time by NNG.

@wlandau
Copy link

wlandau commented May 19, 2023

Seems like there would have to be new logic. Just for the sake of thinking out loud:

  1. Server: when beginning a send, increment a statistic like sends.
  2. Server: create a new condition variable to count dispatcher-side receives.
  3. Dispatcher: check for incoming data without actually downloading it, similar to .unresolved() (is this possible?)
  4. Dispatcher: in the event loop, if (2) shows that the data is completely ready for download from listener TCP socket, then trigger a pipe event to increment the server-side receives condition variable.
  5. Server: if the sends statistic and receives CV are equal to each other, then it is safe to exit.

Is this all possible? Am I missing something? I'm not sure if (4) is possible because the dispatcher is non-polling. Without polling, I suppose a callback mechanism would be needed, and from #42 (comment) it sounds like a callback mechanism does not exist at the NNG level.

@shikokuchuo
Copy link
Owner Author

shikokuchuo commented May 19, 2023

It's just a question of efficiency. You can always do something like send a received ack when dispatcher receives the result from server and have server wait for that. Just sending messages will be more efficient than establishing a new pipe in [4].

However this will mean having a 'receive task' state at server, followed by a 'receive ack' state. Probably robust, but likely 'something they did 30 years ago'...

And I think this will mean doing this for every task, I don't think there's a good way for server to signal 'I want to exit, send an ack next time'.

Repository owner locked and limited conversation to collaborators Jun 27, 2023
@shikokuchuo shikokuchuo converted this issue into discussion #63 Jun 27, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants