-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guarantee that File::write_all
writes all data (or at least tries)
#4316
Guarantee that File::write_all
writes all data (or at least tries)
#4316
Conversation
No need to run CI on this one because I haven't changed any code yet, but I don't think I can disable it myself |
I'm not so sure about making this change for all However on the other hand, tasks spawned just before shutdown when the pool isn't full would be ok to always do it for. That is already behavior that can happen under the current implementation, so it would certainly not be breaking. |
I agree. I wouldn't change this bit of code, which handles that tokio/tokio/src/runtime/blocking/pool.rs Lines 174 to 186 in 4b6bb1d
So this would mean something like: once Note that even in the current implementation, it could be the case that once tokio/tokio/src/runtime/blocking/pool.rs Lines 262 to 267 in 4b6bb1d
and won't notice the In that sense, it's already possible that we'll complete all the blocking tasks in the queue before shutdown, even if all threads were taken at the time of shutdown. |
Interesting. |
I have confirmed this is the case. The following code: #[test]
fn blocking_tasks_block_shutdown() {
let rt = tokio::runtime::Builder::new_multi_thread()
.worker_threads(1)
.max_blocking_threads(1)
.build()
.unwrap();
rt.block_on(async {
for i in 0..10 {
println!(
"TT: {:?} Scheduling task {}",
std::time::Instant::now(),
i);
let j = i;
rt.spawn_blocking(move || {
println!(
"TT: {:?} Initiating task {}",
std::time::Instant::now(),
j
);
std::thread::sleep(
std::time::Duration::from_secs(3)
);
});
}
});
println!(
"TT: {:?} done scheduling tasks",
std::time::Instant::now(),
);
drop(rt);
} produces the following output:
Note that all tasks get executed even 10 seconds after calling |
As for ways forward, I can think of three alternatives:
In my opinion as a tokio noob, option 1 makes more sense. I want to hear what the experts think though. Any thoughts? |
Thoughts:
|
Keeping the blocking nature of
which we know is not true - the method may return before the last batch of data has actually been written to the file. Also, errors originating from the last write will not be returned to the user. The documentation also says that If you agree, I could make a change in https://github.com/tokio-rs/tokio/blob/master/tokio/src/io/util/write_all.rs#L40 and then we could deal with runtime shutdown and blocking tasks in a separate PR |
For one, this would affect all IO resources - they are not customizable. For another, I still think it's a bug for |
But is that a problem? I see that flush is a no-op on UNIX domain sockets and TCP streams. Not sure about other IO resources though.
As in, Or are you saying that |
The
I mean that the runtime should wait for the |
Got it, that makes sense. |
Ok makes sense. What if:
Does this sound like a promising approach? I have also thought about extending the runtime to know about There would still be some open questions:
|
This is more or less what I was thinking the solution would be myself, so it makes sense.
The names are ok for now.
The operations will continue in the background if you use them. If someone exits the process before that, then we lose the writes, but there's not anything we can do about that.
No. |
Cool! I'll look into it |
78931cc
to
6a7da9a
Compare
This is still a work in progress. I've clumsily written what I think is a fix. I have no idea whether I put the I am thinking about how to test the fix. I managed to test it manually using the code in the original issue. Without the fix, that test fails at most in a few hundred iterations. With the fix, I had to ctrl-c the script after 60K thousand iterations, as it wasn't failing anymore. However, this is not a very good test for the automated test suite. I was thinking about using loom::model(|| {
runtime.block_on(async {
let mut file = File::create("foo.txt").await?;
file.write_all(b"some bytes").await?;
Ok(())
});
drop(runtime);
// check file contains "some bytes"
}); but maybe it's not possible to do this... |
You can probably write literally the loom test you posted here. It would go in this directory. Though we might want the test to just use |
6a7da9a
to
dfe6b0d
Compare
I tried writing a loom test but couldn't manage to write one that would fail when using
The approach is not great for some reasons:
On the other hand, the test fails relatively quickly when using What do you think of the testing approach? Is it acceptable? Or should I keep trying to get a loom test that showcases the issue? Once we agree on the testing approach I will proceed to clean-up the implementation (of both the tests and the fix). |
This loom test fails, and if your PR is correct, then it should succeed if changed to use use crate::runtime::tests::loom_oneshot;
#[test]
fn spawn_blocking_should_always_run() {
loom::model(|| {
let rt = Builder::new_current_thread().build().unwrap();
let (tx, rx) = loom_oneshot::channel();
rt.spawn_blocking(|| {});
rt.spawn_blocking(move || {
let _ = tx.send(());
});
drop(rt);
// This call will deadlock if `spawn_blocking` doesn't run.
let () = rx.recv();
});
} Note that loom will catch the deadlock and panic, so the above test doesn't actually hang. You can put the test in one of the files in |
Thank you very much for helping out @Darksonn! That test fails as expected with |
Well there are a few builds failing that I could put more time into fixing, but I wanted to ask first whether the approach seems reasonable. I'm most interested in whether adding the Is it fine to put the |
tokio/src/blocking.rs
Outdated
F: FnOnce() -> R + Send + 'static, | ||
R: Send + 'static, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rustfmt doesn't look inside macro invocations.
F: FnOnce() -> R + Send + 'static, | |
R: Send + 'static, | |
F: FnOnce() -> R + Send + 'static, | |
R: Send + 'static, |
tokio/src/runtime/task/mod.rs
Outdated
/// This type holds two ref-counts. | ||
pub(crate) struct UnownedTask<S: 'static> { | ||
raw: RawTask, | ||
is_mandatory: bool, | ||
_p: PhantomData<S>, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it makes sense to put it here, but maybe we should either rename the struct to something like BlockingTask
or put the boolean in a wrapper struct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've moved things around a bit. I took your advice of defining a wrapper struct that contains a UnownedTask
and a is_mandatory
field. The wrapper struct is defined in pool.rs
, next to where it's used. Unit tests that use unowned
can keep using it without caring about the is_mandatory
field.
(I've also changed is_mandatory: bool
to a mandatory: Mandatory
where Mandatory
is a new enum. I did this to make the call-site of code using this parameter easier to read)
I do think that the approach makes sense. |
Ah, no, I had just missed the notification that you updated the PR. |
let (tx, rx) = loom_oneshot::channel(); | ||
let handle = runtime::spawn_mandatory_blocking(move || { | ||
let _ = tx.send(()); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this test not do an extra spawn_blocking
first like the other test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each test is covering a different race related to shutting down:
- The first one makes sure that, if a thread in the blocking pool gets awaken and shutdown has been signaled already, all mandatory tasks will get executed. Reaching this scenario requires putting the initial
spawn_blocking_task
, otherwise the thread in the blocking pool will not get to check the shutdown flag before executing the mandatory task. - The second one makes sure that, if the runtime had shutdown by the time the spawning thread was spawning the mandatory task, an error will be communicated to the caller. (Or the contrapositive: if calling spawn_mandatory_blocking doesn't err, the task will be executed).
I've checked that both tests are useful by changing the implementations slightly and seeing them fail.
I could also write a test that covers the combination of both races if you want, but it wouldn't add a lot of value IIUC because both races are independent
Besides my comment on the error, I am fine with this. |
The `UnownedTask` type will have an extra field called `is_mandatory` which will determine whether the tasks get executed at shutdown. Maybe the `RawTask` is a better place to put this field, but I don't know my way around these types yet. I have verified that this code change fixes the problem by running the code in the original issue a hundred thousand times. Without the fix, that code consistently misses a write in the first few hundred executions. With the change, I've never seen it miss the write. I think we might be able to use `loom` to test this. Will try to do so.
This is just a prototype to communicate that I didn't get the loom test to work and to see if this testing approach would be accepted. It's not great, because the test implementation is very coupled to the blocking pool implementation (f.i, spawning a previous blocking task and awaiting on it before launching the offending task). However, it takes 150ms to run on my machine when it succeeds, and fails in the first few attempts when using `spawn_blocking` instead of `spawn_mandatory_blocking`, which is a good sign
The previous approach added it to the `UnownedTask` type. However, the type is also used in contexts where the concept of mandatory doesn't apply (unit tests). This new approach adds the information about "mandatory-ness" right where it makes sense: the blocking pool.
The purpose of this enum is to be just like the `is_mandatory: bool`, only that it makes call-site easier to understand. I have also introduced some methods in the `pool::Task` struct to make the code a bit nicer
There are only two pieces of code that use `spawn_blocking_inner`: 1. `tokio::fs::File`, which uses a mock of `spawn_blocking_inner` in the tests. 2. The loom test that covers it. All other test build result in the function not being used, hence the dead_code
When running `cargo test` on `<repo>/tokio`, you would expect that `rustc` only gets executed against `tokio` using the `--test` flag, which would make the `cfg_attr(test, ...)` conditional attribute apply. That's not true though. When running `cargo test` on `tokio`, the tokio crate gets built *twice*, once without `--test` and once with (this can be verified using strace). This is because, f.i, tokio specifies a dev-dependency on `tokio-test, which specifies back a dependency on `tokio`. So when running a regular test build, tokio will first get built without `--test`. We will not get dead code errors there because for that build, `tokio::fs::File` uses the new `spawn_mandatory_blocking`. For the next build, the one with `--test`, we will not get dead code errors because, even though `tokio::fs::File` uses a mock `spawn_mandatory_blocking`, the `cfg_attr(test` is working as expected. Things are different for loom builds. We will first get a build of tokio without `--test` but with the `--cfg loom` flag. The fact that `tokio::fs::File` uses `spawn_mandatory_blocking` won't save us here, because `tokio:fs` doesn't get compiled in loom builds The solution I can think of is extending the `cfg_attr(test` to `cfg_attr(any(test, loom)`, which is a bit unfortunate because `spawn_mandatory_blocking` *is* used in the loom tests, only not in the regular loom build of the crate, which is triggered by the cyclical dependency between tokio and tokio-test.
On a previous iteration of this PR, there could be a race between calling `spawn_mandatory_blocking` and dropping the runtime. This could result in f.i, a call to `File::write` returning successfully but the actual write not getting scheduled. Now `spawn_mandatory_blocking` will keep track of whether the task was actually scheduled. If not, the `File::write` method will return an error, letting users know that the write did not, and will not happen. I have also added a loom test that checks that the return value of `spawn_mandatory_blocking` can be trusted, even when shutting down from a different thread.
This way it is consistent with `asyncify`
7c9a83d
to
f96fa96
Compare
I believe this is done then? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems good.
Thanks! |
I missed this, but I just wanted to say thanks to all and this solution is simpler than anything I had thought of yet. Well done 👍 |
Motivation
Ref: #4296
#4296 (comment)
In some cases, one could time successfully awaited calls to
write_all
(orwrite
) with a shutdown of the runtime,and have the
write
not even be attempted. This can be a bit surprising.The purpose of this PR is to find a way (if possible) to fix that. There would be no guarantee that the
write
actuallysucceeds (any OS error could be hit at the time the write actually gets executed), but at least it would be attempted.
Solution
I have found a sequence of events that leads to
spawn_blocking
tasks being "ignored". I've written a note about itin a comment. I'm not sure if it's intentional that we won't try draining the queue of blocking tasks before shutting down.
Couldn't we tweak the shutdown logic to execute all tasks that were scheduled before the call to
shutdown
?If users are concerned about shutting down the runtime taking a long time because of blocking tasks, they can call
https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#method.shutdown_timeout or
shutdown_background
.The documentation in https://docs.rs/tokio/latest/tokio/runtime/struct.Runtime.html#shutdown says:
So it should already be expected that shutting down a runtime could block to some extent?
Do you think it would make sense to change the shutdown logic to execute all pending tasks? If so, I can figure out how to
do the code change.