Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close the fd that we're removing from the pollfds list #96

Closed
wants to merge 1 commit into from

Conversation

jrmuizel
Copy link
Contributor

We currently dup the file descriptor in add() by calling consume_fd(). We never close this dup'd fd. We should close it when we remove it from the pollfds list.

We currently dup the file descriptor in add() by calling consume_fd(). We never close this dup'd fd. We should close it when we remove it from the pollfds list.
@glennw
Copy link
Member

glennw commented Aug 16, 2016

r? @pcwalton

@pcwalton
Copy link
Contributor

cc @antrik

@pcwalton
Copy link
Contributor

@antrik Does this look right to you?

@antrik
Copy link
Contributor

antrik commented Aug 17, 2016

@pcwalton yes, as the FD is owned by the receiver set once it's added to it, I'm pretty sure it's correct to close it when it's dropped from the set... Another case to convince me we need a proper FD type to handle these things properly :-)

(Would be nice if the fix came with test cases, though...)

@nox
Copy link
Contributor

nox commented Aug 18, 2016

@antrik How would you test that?

@antrik
Copy link
Contributor

antrik commented Aug 18, 2016

@nox well, most important would be a baseline test to check that channel close in a receiver set is still handled correctly. So -- unless there is already an existing test case that covers this -- we'd do something like dropping the send end of one channel in a set, and check that the select correctly receives the notification, and that "ordinary" receives on the other channels in the set still work.

Ideally, it would be nice also to check that the FD is actually not leaked any more. (Mostly to prevent regressions in the future.) Admittedly, I don't know how to do this correctly off-hand... Might be non-trivial when tests are executed in parallel.

@metajack
Copy link

Is there a linux tool that will tell you what fds got closed in program termination? Similar to how valgrind can find leaked memory at the end of execution?

@jrmuizel
Copy link
Contributor Author

It looks like there's another problem introduced by this patch:
If I add an assert(!self.handlers.contains_key(&new_receiver_id)) to router.rs just before it adds the new_receiver_id the assert will sometimes fail. Without the assert I believe will get confused if when we replace the handler for duplicate ids. Before my patch leaking the fd would cause us to never duplicate the fds avoiding this issue.

@jrmuizel
Copy link
Contributor Author

Indeed. Here's what happens:
fd = 75
select()

  • we get an AddRoute() message and add it to the results vec
  • we get a connectclosed causing us to close(75) and add an entry to the results vec that will cause us remove 75 from the handlers hash map.

we iterate over the results vec:

  • We see the AddRoute message and dup the fd that we just got. dup() returns 75
    • we replace the existing handler associated with 75 with our new one
  • We get a ChannelClosed message with the fd of 75. We remove the handler that we just added and now we're missing a handler for 75 which causes us to panic as soon as we get a message from that fd.

To avoid this we should probably not close the fds until after they are removed from the HashMap.

@antrik
Copy link
Contributor

antrik commented Aug 20, 2016

@jrmuizel took me quite a while to figure out what you are talking about...

So, I didn't realise that the FD is returned along with the ChannelClosed message. In that case of course we should not close it inside select(), but rather leave it to the caller to take care of it while handling the ChannelClosed message, as ownership is passed back with the message.

And it would be really nice if you could put that into a test case :-)

@dlrobertson
Copy link
Contributor

👍 but FYI this exposes an intermittent that I hit in earlier revisions of #94

    Finished debug [unoptimized + debuginfo] target(s) in 0.0 secs
     Running target/debug/deps/ipc_channel-d988de77ccb3242d

running 56 tests
test platform::test::big_data ... ok
test platform::test::big_data_with_0_fds ... ok
test platform::test::big_data_with_1_fds ... ok
test platform::test::big_data_with_2_fds ... ok
test platform::test::big_data_with_3_fds ... ok
test platform::test::big_data_with_5_fds ... ok
test platform::test::big_data_with_4_fds ... ok
test platform::test::cross_process ... ok
test platform::test::big_data_with_6_fds ... ok
test platform::test::cross_process_sender_transfer ... ok
test platform::test::fragment_tests::full_packet ... ok
test platform::test::fragment_tests::full_packet_with_1_fds ... ok
test platform::test::fragment_tests::full_packet_with_2_fds ... ok
test platform::test::fragment_tests::full_packet_with_3_fds ... ok
test platform::test::fragment_tests::full_packet_with_4_fds ... ok
test platform::test::fragment_tests::full_packet_with_5_fds ... ok
test platform::test::fragment_tests::full_packet_with_64_fds ... ok
test platform::test::fragment_tests::full_packet_with_6_fds ... ok
test platform::test::fragment_tests::overfull_packet ... ok
test platform::test::medium_data ... ok
test platform::test::fragment_tests::overfull_packet_with_63_fds ... ok
test platform::test::medium_data_with_sender_transfer ... ok
test platform::test::multisender_transfer ... ok
test platform::test::no_senders_notification ... ok
test platform::test::receiver_transfer ... ok
test platform::test::receiver_set ... ok
test platform::test::sender_transfer ... ok
test platform::test::server ... ok
test platform::test::shared_memory_clone ... ok
test platform::test::simple ... ok
test platform::test::try_recv ... ok
test platform::test::shared_memory ... ok
test platform::test::big_data_with_sender_transfer ... ok
test test::bytes ... ok
test test::cross_process_embedded_senders ... ok
test test::embedded_bytes_receivers ... ok
test test::embedded_opaque_senders ... ok
test test::embedded_receivers ... ok
test test::embedded_senders ... ok
test test::multiple_paths_to_a_sender ... ok
test test::opaque_sender ... ok
test platform::test::try_recv_large ... ok
test platform::test::concurrent_senders ... ok
thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', ../src/libcore/option.rs:326
stack backtrace:
test test::router_drops_callbacks_on_sender_shutdown ... ok
test test::router_drops_callbacks_on_cloned_sender_shutdown ... ok
   1:     0x564ab56deb09 - std::sys::backtrace::tracing::imp::write::h482d45d91246faa2
   2:     0x564ab56e28ac - std::panicking::default_hook::_{{closure}}::h89158f66286b674e
   3:     0x564ab56e1ace - std::panicking::default_hook::h9e30d428ee3b0c43
   4:     0x564ab56e21e8 - std::panicking::rust_panic_with_hook::h2224f33fb7bf2f4c
   5:     0x564ab56e2082 - std::panicking::begin_panic::hcb11a4dc6d779ae5
   6:     0x564ab56e1fb0 - std::panicking::begin_panic_fmt::h310416c62f3935b3
   7:     0x564ab56e1f31 - rust_begin_unwind
   8:     0x564ab571c69f - core::panicking::panic_fmt::hc5789f4e80194729
   9:     0x564ab571c5cb - core::panicking::panic::h1953378f4b37b561
  10:     0x564ab557d18e - _<core..option..Option<T>>::unwrap::he7c58c58c9b9f3b3
test test::router_big_data ... ok
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/option.rs:21
  11:     0x564ab567c6c9 - ipc_channel::router::Router::run::h7e7ecc4af14c50b9
                        at /home/drobertson/git/servo/test_case/src/router.rs:119
  12:     0x564ab569a4db - ipc_channel::router::RouterProxy::new::_{{closure}}::hcc2e6631684eade8
                        at /home/drobertson/git/servo/test_case/src/router.rs:31
  13:     0x564ab5640982 - _<std..panic..AssertUnwindSafe<F> as core..ops..FnOnce<()>>::call_once::hd8dfcbe69fa0dd0d
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panic.rs:256
  14:     0x564ab55a24a8 - std::panicking::try::do_call::he984b35af97d3741
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:327
  15:     0x564ab56ea396 - __rust_maybe_catch_panic
  16:     0x564ab559f9ae - std::panicking::try::hdf2a572ade47ad8b
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panicking.rs:303
  17:     0x564ab5592551 - std::panic::catch_unwind::h608b762e2cb058c4
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libstd/panic.rs:312
  18:     0x564ab5689f3f - std::thread::Builder::spawn::_{{closure}}::h523461f7ed82eb18
  19:     0x564ab5604893 - _<F as alloc..boxed..FnBox<A>>::call_box::h8fbc29889d9e2772
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/liballoc/boxed.rs:587
  20:     0x564ab56e0752 - std::sys::thread::Thread::new::thread_start::he0bf102845911132
  21:     0x7f142e670403 - start_thread
  22:     0x7f142e1988dc - clone
  23:                0x0 - <unknown>
test test::router_multithreaded_multiplexing ... FAILED
test test::router_multiplexing ... FAILED
test test::router_routing_to_new_mpsc_receiver ... FAILED
test test::router_simple ... FAILED
test test::simple ... ok
test test::select ... ok
test test::test_so_linger ... ok
test test::try_recv ... ok
test test::shared_memory ... ok
test platform::test::try_recv_large_delayed ... ok

failures:

---- test::router_multithreaded_multiplexing stdout ----
    thread 'test::router_multithreaded_multiplexing' panicked at 'called `Result::unwrap()` on an `Err` value: RecvError', ../src/libcore/result.rs:788
stack backtrace:
   1:     0x564ab56deb09 - std::sys::backtrace::tracing::imp::write::h482d45d91246faa2
   2:     0x564ab56e28ac - std::panicking::default_hook::_{{closure}}::h89158f66286b674e
   3:     0x564ab56e19d7 - std::panicking::default_hook::h9e30d428ee3b0c43
   4:     0x564ab56e21e8 - std::panicking::rust_panic_with_hook::h2224f33fb7bf2f4c
   5:     0x564ab56e2082 - std::panicking::begin_panic::hcb11a4dc6d779ae5
   6:     0x564ab56e1fb0 - std::panicking::begin_panic_fmt::h310416c62f3935b3
   7:     0x564ab56e1f31 - rust_begin_unwind
   8:     0x564ab571c69f - core::panicking::panic_fmt::hc5789f4e80194729
   9:     0x564ab55fa3fe - core::result::unwrap_failed::h3e79b7b152b9c363
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:29
  10:     0x564ab55be58d - _<core..result..Result<T, E>>::unwrap::hffc222f316ec9d30
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:726
  11:     0x564ab56815cd - ipc_channel::test::router_multithreaded_multiplexing::h89237eebed95563e
                        at /home/drobertson/git/servo/test_case/src/test.rs:209
  12:     0x564ab56a756b - _<F as alloc..boxed..FnBox<A>>::call_box::h6d40ac6ff8180289
  13:     0x564ab56a06ff - std::panicking::try::do_call::hf4aa30d569353639
  14:     0x564ab56ea396 - __rust_maybe_catch_panic
  15:     0x564ab56a6f2a - _<F as alloc..boxed..FnBox<A>>::call_box::h116d9064e2ce57af
  16:     0x564ab56e0752 - std::sys::thread::Thread::new::thread_start::he0bf102845911132
  17:     0x7f142e670403 - start_thread
  18:     0x7f142e1988dc - clone
  19:                0x0 - <unknown>

---- test::router_multiplexing stdout ----
    thread 'test::router_multiplexing' panicked at 'called `Result::unwrap()` on an `Err` value: RecvError', ../src/libcore/result.rs:788
stack backtrace:
   1:     0x564ab56deb09 - std::sys::backtrace::tracing::imp::write::h482d45d91246faa2
   2:     0x564ab56e28ac - std::panicking::default_hook::_{{closure}}::h89158f66286b674e
   3:     0x564ab56e19d7 - std::panicking::default_hook::h9e30d428ee3b0c43
   4:     0x564ab56e21e8 - std::panicking::rust_panic_with_hook::h2224f33fb7bf2f4c
   5:     0x564ab56e2082 - std::panicking::begin_panic::hcb11a4dc6d779ae5
   6:     0x564ab56e1fb0 - std::panicking::begin_panic_fmt::h310416c62f3935b3
   7:     0x564ab56e1f31 - rust_begin_unwind
   8:     0x564ab571c69f - core::panicking::panic_fmt::hc5789f4e80194729
   9:     0x564ab55fa3fe - core::result::unwrap_failed::h3e79b7b152b9c363
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:29
  10:     0x564ab55be58d - _<core..result..Result<T, E>>::unwrap::hffc222f316ec9d30
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:726
  11:     0x564ab5680cf0 - ipc_channel::test::router_multiplexing::h5c857a363c29e06c
                        at /home/drobertson/git/servo/test_case/src/test.rs:190
  12:     0x564ab56a756b - _<F as alloc..boxed..FnBox<A>>::call_box::h6d40ac6ff8180289
  13:     0x564ab56a06ff - std::panicking::try::do_call::hf4aa30d569353639
  14:     0x564ab56ea396 - __rust_maybe_catch_panic
  15:     0x564ab56a6f2a - _<F as alloc..boxed..FnBox<A>>::call_box::h116d9064e2ce57af
  16:     0x564ab56e0752 - std::sys::thread::Thread::new::thread_start::he0bf102845911132
  17:     0x7f142e670403 - start_thread
  18:     0x7f142e1988dc - clone
  19:                0x0 - <unknown>

---- test::router_routing_to_new_mpsc_receiver stdout ----
    thread 'test::router_routing_to_new_mpsc_receiver' panicked at 'called `Result::unwrap()` on an `Err` value: RecvError', ../src/libcore/result.rs:788
stack backtrace:
   1:     0x564ab56deb09 - std::sys::backtrace::tracing::imp::write::h482d45d91246faa2
   2:     0x564ab56e28ac - std::panicking::default_hook::_{{closure}}::h89158f66286b674e
   3:     0x564ab56e19d7 - std::panicking::default_hook::h9e30d428ee3b0c43
   4:     0x564ab56e21e8 - std::panicking::rust_panic_with_hook::h2224f33fb7bf2f4c
   5:     0x564ab56e2082 - std::panicking::begin_panic::hcb11a4dc6d779ae5
   6:     0x564ab56e1fb0 - std::panicking::begin_panic_fmt::h310416c62f3935b3
   7:     0x564ab56e1f31 - rust_begin_unwind
   8:     0x564ab571c69f - core::panicking::panic_fmt::hc5789f4e80194729
   9:     0x564ab55fa3fe - core::result::unwrap_failed::h3e79b7b152b9c363
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:29
  10:     0x564ab55be58d - _<core..result..Result<T, E>>::unwrap::hffc222f316ec9d30
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:726
  11:     0x564ab568077b - ipc_channel::test::router_routing_to_new_mpsc_receiver::h0952908428392eae
                        at /home/drobertson/git/servo/test_case/src/test.rs:176
  12:     0x564ab56a756b - _<F as alloc..boxed..FnBox<A>>::call_box::h6d40ac6ff8180289
  13:     0x564ab56a06ff - std::panicking::try::do_call::hf4aa30d569353639
  14:     0x564ab56ea396 - __rust_maybe_catch_panic
  15:     0x564ab56a6f2a - _<F as alloc..boxed..FnBox<A>>::call_box::h116d9064e2ce57af
  16:     0x564ab56e0752 - std::sys::thread::Thread::new::thread_start::he0bf102845911132
  17:     0x7f142e670403 - start_thread
  18:     0x7f142e1988dc - clone
  19:                0x0 - <unknown>

---- test::router_simple stdout ----
    thread 'test::router_simple' panicked at 'called `Result::unwrap()` on an `Err` value: "SendError(..)"', ../src/libcore/result.rs:788
stack backtrace:
   1:     0x564ab56deb09 - std::sys::backtrace::tracing::imp::write::h482d45d91246faa2
   2:     0x564ab56e28ac - std::panicking::default_hook::_{{closure}}::h89158f66286b674e
   3:     0x564ab56e19d7 - std::panicking::default_hook::h9e30d428ee3b0c43
   4:     0x564ab56e21e8 - std::panicking::rust_panic_with_hook::h2224f33fb7bf2f4c
   5:     0x564ab56e2082 - std::panicking::begin_panic::hcb11a4dc6d779ae5
   6:     0x564ab56e1fb0 - std::panicking::begin_panic_fmt::h310416c62f3935b3
   7:     0x564ab56e1f31 - rust_begin_unwind
   8:     0x564ab571c69f - core::panicking::panic_fmt::hc5789f4e80194729
   9:     0x564ab55fb0a2 - core::result::unwrap_failed::h9221b81b86fd1894
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:29
  10:     0x564ab55b93f3 - _<core..result..Result<T, E>>::unwrap::h532b0ba71b2356bc
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/obj/../src/libcore/result.rs:726
  11:     0x564ab567b84b - ipc_channel::router::RouterProxy::add_route::h4bfbd8e9986ce5e2
                        at /home/drobertson/git/servo/test_case/src/router.rs:42
  12:     0x564ab568025e - ipc_channel::test::router_simple::haf5d862c4d6be2a6
                        at /home/drobertson/git/servo/test_case/src/test.rs:162
  13:     0x564ab56a756b - _<F as alloc..boxed..FnBox<A>>::call_box::h6d40ac6ff8180289
  14:     0x564ab56a06ff - std::panicking::try::do_call::hf4aa30d569353639
  15:     0x564ab56ea396 - __rust_maybe_catch_panic
  16:     0x564ab56a6f2a - _<F as alloc..boxed..FnBox<A>>::call_box::h116d9064e2ce57af
  17:     0x564ab56e0752 - std::sys::thread::Thread::new::thread_start::he0bf102845911132
  18:     0x7f142e670403 - start_thread
  19:     0x7f142e1988dc - clone
  20:                0x0 - <unknown>


failures:
    test::router_multiplexing
    test::router_multithreaded_multiplexing
    test::router_routing_to_new_mpsc_receiver
    test::router_simple

test result: FAILED. 52 passed; 4 failed; 0 ignored; 0 measured

error: test failed

@antrik
Copy link
Contributor

antrik commented Aug 25, 2016

So I looked into this a bit more: and the whole idea of casting the FD as a plain integer to use as ID seems fundamentally flawed to me. I don't see a way to fix the leak without introducing a race while keeping this approach.

The macos backend has the same problem I believe: while I don't know the exact behaviour of port sets -- and the documentation is severely lacking on this point -- I have a strong suspicion the port is leaked there as well. While that's not nearly as expensive as a UNIX FD leak, it's not OK either.

The inprocess implementation appears correct: It explicitly generates and keeps track of IDs, so IDs once used remain "reserved", and no collisions will happen, even though the actual channel is closed (specifically, the receiver is dropped) within select(), before the result with the ID is handled by the caller. A similar approach can probably be adapted for the other backends -- though the double lookup within select() versus the caller is somewhat inefficient.

(The only way I can see to handle this in a both robust and efficient fashion, is to handle lookup entirely within the backends -- with an interface that doesn't expose IDs, but rather takes and returns some arbitrary payload supplied and used by the caller.)

@antrik
Copy link
Contributor

antrik commented Aug 25, 2016

(For the record, I didn't invent the approach I'm suggesting here as a better alternative: it's just a safe variant of what is known as "protected payloads" in the IPC mechanisms of many modern micro-kernels.)

@pcwalton
Copy link
Contributor

@antrik I'm confused—is this patch OK to merge or is it broken in some way?

@jrmuizel
Copy link
Contributor Author

This patch exposes a potential race that will cause panics. It's probably better to keep leaking now until the race is fixed.

@antrik
Copy link
Contributor

antrik commented Oct 19, 2016

I guess this can be closed in favour of #105 ?

@bors-servo
Copy link
Contributor

☔ The latest upstream changes (presumably #102) made this pull request unmergeable. Please resolve the merge conflicts.

dlrobertson added a commit to dlrobertson/ipc-channel that referenced this pull request Nov 8, 2016
For unix OSes make sure to close the file descriptor on ChannelClosed.
The file descriptors should be closed in select to avoid leaking fds
after removal from pollfds (servo#96). After e06edbc this is safe and
avoids the previously seen race condition due to file descriptor
reuse.
dlrobertson added a commit to dlrobertson/ipc-channel that referenced this pull request Nov 10, 2016
For unix OSes make sure to close the file descriptor on ChannelClosed.
The file descriptors should be closed in select to avoid leaking fds
after removal from pollfds (servo#96). After e06edbc this is safe and
avoids the previously seen race condition due to file descriptor
reuse.
@dlrobertson
Copy link
Contributor

Since #105 has been merged this can probably be closed. @jrmuizel great work discovering this!

@jdm jdm closed this Nov 11, 2016
dlrobertson added a commit to dlrobertson/ipc-channel that referenced this pull request Nov 15, 2016
For unix OSes make sure to close the file descriptor on ChannelClosed.
The file descriptors should be closed in select to avoid leaking fds
after removal from pollfds (servo#96). After e06edbc this is safe and
avoids the previously seen race condition due to file descriptor
reuse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants