Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File descriptor is closed before async callbacks are called from libuv adapter #814

Closed
dmitry-sles-novikov opened this issue Oct 18, 2024 · 6 comments · Fixed by #815
Closed
Assignees
Labels
defect Suspected defect such as a bug or regression

Comments

@dmitry-sles-novikov
Copy link

dmitry-sles-novikov commented Oct 18, 2024

Observed behavior

nats.c library are using uv_poll_init_socket in libuv adapter

From documentation:

The purpose of poll handles is to enable integrating external libraries that rely on the event loop to signal it about the socket status changes, like c-ares or libssh2

The user should not close a file descriptor while it is being polled by an active poll handle. This can cause the handle to report an error, but it might also start polling another socket. However the fd can be safely closed immediately after a call to uv_poll_stop() or uv_close()

But nats.c library close nc->sockCtx.fd before calling the detach callback

        // If there is no readLoop (or using external event loop), then it is
        // our responsibility to close the socket. Otherwise, _readLoop is the
        // one doing it.
        if (ttj.readLoop == NULL)
        {
            // If event loop attached, stop polling...
            if (nc->el.attached)
                _evStopPolling(nc); <--- async stop polling from libuv thread

            natsSock_Close(nc->sockCtx.fd); <--- closing socket
            nc->sockCtx.fd = NATS_SOCK_INVALID;

            // We need to cleanup some things if the connection was SSL.
            _clearSSL(nc);
        }
     .....
     if (detach)
    {
        nc->opts->evCbs.detach(nc->el.data); <--- async detach from libuv thread
        natsConn_release(nc);
    }

As a result, in the "good" case we have a race condition if the descriptor is assigned to another file, in the bad case the application terminates inside uv__epoll_ctl_flush function of the libuv library because cqe->read == -EBADF:

static void uv__epoll_ctl_flush(int epollfd,
                                struct uv__iou* ctl,
                                struct epoll_event (*events)[256]) {
  ...

  /* Failed submissions are either EPOLL_CTL_DEL commands for file descriptors
   * that have been closed, or EPOLL_CTL_ADD commands for file descriptors
   * that we are already watching. Ignore the former and retry the latter
   * with EPOLL_CTL_MOD.
   */
  while (*ctl->cqhead != *ctl->cqtail) {
    ...

    if (cqe->res != -EEXIST)
      abort(); <--- aborted here

    uv__epoll_ctl_prep(epollfd,
                       ctl,
                       events,
                       EPOLL_CTL_MOD,
                       fd,
                       &oldevents[oldslot]);
  }
}

Expected behavior

File descriptor is closed after calling of uv_close((uv_handle_t*) nle->handle, finalCloseCb);, for example in finalCloseCb

Server and client version

client version: 3.9.1 and early

Host environment

No response

Steps to reproduce

No response

@dmitry-sles-novikov dmitry-sles-novikov added the defect Suspected defect such as a bug or regression label Oct 18, 2024
@dmitry-sles-novikov
Copy link
Author

Thanks in advance =)

@kozlovic
Copy link
Member

@dmitry-sles-novikov The detach (and finalCloseCb) would be just solving the normal connection close, but we also have to take into account the reconnects. I had a look and it will take a bit of time since we need to make sure to not reintroduce a FD leak that was solved with this code (see #375, #376). I did a bit of a POC and moved some socket close from NATS library to the libuv adapter itself (but then needs to be done to the libevent adapter too), and it seems ok, but again, need to spend more time on that one.

@kozlovic kozlovic self-assigned this Oct 18, 2024
@dmitry-sles-novikov
Copy link
Author

@kozlovic Yes, you are right, I did not take into account the reconnect case. I wanted to give feedback on the founded errors.

Thank you very much, I will wait for updates

@kozlovic
Copy link
Member

@dmitry-sles-novikov It just occurred to me: if you say that there is a double close of the socket, in the NATS library and libuv, then I should not have to close the socket in the adapter. However, I noticed that if I remove from the library and don't touch the adapter, then I see socket leaked (at least stay in CLOSE_WAIT) when the server is restarted. So I am a bit confused. Could it be that you or ClickHouse is doing a socket close or is it the libuv library itself? I would need to know that. I see from your other PR that you have a dedicated NATS libuv adapter (not using the one from this repo), so maybe that's where you were doing a socket close?

@dmitry-sles-novikov
Copy link
Author

@kozlovic I created the adapter before I understood the causes of the problems and plan to remove it in the future. It minimizes the likelihood of test failures, but does not fix the error.

As far as I understood from the libuv documentation, uv_poll_t does not own the file descriptor, but only polls it. Therefore, responsibility for opening and closing the file descriptor lies with its owner. But it can be closed only after the socket polling is completed due to a call to uv_poll_stop() or uv_close():

From libuv tests

  uv_poll_t poll_handle;
  int fd;

->fd = epoll_create(1);
  ASSERT_NE(fd, -1);

->ASSERT_OK(uv_poll_init (uv_default_loop(), &poll_handle, fd));
  ASSERT_OK(uv_poll_start(&poll_handle, UV_READABLE, (uv_poll_cb) abort));
  ASSERT_NE(0, uv_run(uv_default_loop(), UV_RUN_NOWAIT));

->uv_close((uv_handle_t*) &poll_handle, NULL);
  ASSERT_OK(uv_run(uv_default_loop(), UV_RUN_DEFAULT));
->ASSERT_OK(close(fd));

P.S. I will still edit PR in clickhouse and check it, including for double socket closing

kozlovic added a commit that referenced this issue Oct 22, 2024
The socket was closed by the NATS library itself, which could cause
some issue when, specifically libuv, could still be polling it.
We now defer to the event loop adapter to make sure that the event
loop library is done polling before invoking a new function that
will take care of closing the socket.

Resolves #814

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Oct 23, 2024
The socket was closed by the NATS library itself, which could cause
some issue when the event loop thread would still be polling it.
We now defer to the event loop adapter to make sure that the event
loop library is done polling before invoking a new function that
will take care of closing the socket.

I have updated the event loop test (that simulates what our adapters are
doing). The mockup event loop implementation is a bit too simplistic
but should be ok for now. If we have issues, we would have to make
the events a linked list.

Resolves #814

Signed-off-by: Ivan Kozlovic <[email protected]>
kozlovic added a commit that referenced this issue Oct 23, 2024
The socket was closed by the NATS library itself, which could cause
some issue when the event loop thread would still be polling it.
We now defer to the event loop adapter to make sure that the event
loop library is done polling before invoking a new function that
will take care of closing the socket.

I have updated the event loop test (that simulates what our adapters are
doing). The mockup event loop implementation is a bit too simplistic
but should be ok for now. If we have issues, we would have to make
the events a linked list.

Resolves #814

Signed-off-by: Ivan Kozlovic <[email protected]>
@kozlovic
Copy link
Member

@dmitry-sles-novikov Could you have a look at PR #815 and I would really appreciate if you could build from that branch (and use the new libuv adapter file) and see if that addresses the issue you have observed? Thanks!

@levb levb closed this as completed in #815 Nov 5, 2024
@levb levb closed this as completed in 297614a Nov 5, 2024
levb pushed a commit to levb/nats.c that referenced this issue Dec 9, 2024
…ng (nats-io#815)

* [FIXED] EventLoop: Socket now closed only after event loop done polling

The socket was closed by the NATS library itself, which could cause
some issue when the event loop thread would still be polling it.
We now defer to the event loop adapter to make sure that the event
loop library is done polling before invoking a new function that
will take care of closing the socket.

I have updated the event loop test (that simulates what our adapters are
doing). The mockup event loop implementation is a bit too simplistic
but should be ok for now. If we have issues, we would have to make
the events a linked list.

Resolves nats-io#814

Signed-off-by: Ivan Kozlovic <[email protected]>

* Updates based on PR feedback

- Move the `uv_async_send` under our lock to avoid crash/race
- Replace `uv_poll_stop` with `uv_close` and deal with nle->handle
in place and not again in the final close callback.

Signed-off-by: Ivan Kozlovic <[email protected]>

---------

Signed-off-by: Ivan Kozlovic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants