Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokio-uring design proposal #1

Merged
merged 21 commits into from
Aug 9, 2022
Merged

Tokio-uring design proposal #1

merged 21 commits into from
Aug 9, 2022

Conversation

carllerche
Copy link
Member

@carllerche carllerche commented Mar 30, 2021

Rendered

The RFC proposes a new asynchronous Rust runtime backed by io-uring as a new crate: tokio-uring. The API aims to be as close to idiomatic Tokio, but will deviate when necessary to provide full access to io-uring's capabilities. It also will be compatible with existing Tokio libraries. The runtime will use an isolated thread-per-core model, and many types will be !Send.

The repo contains a proof-of-concept implementation.

spawn(async move { ... });
}
})
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be best for there to be a clear model for how we might move an IO object from one thread to another, should such advanced load balancing become required. For example, a webserver might find that a bunch of high bandwidth clients ended up on a single thread, overloading that thread.

I'm actually of the opinion that an explicit runtime handle might be the way to deal with this. Consider something like this sketch:

struct LocalRuntime { ... } // !Sync

fn spawn_new_group<R, F: Future<Output=R> + Send>(f: FnOnce(&LocalRuntime)->F) -> GroupHandle<R>;

impl LocalRuntime {
  fn spawn<R, F: Future<Output=R> + Send>(f: impl FnOnce(&LocalRuntime)->F) -> SpawnHandle<R>;
}

impl TcpStream {
  fn bind_to_runtime<'a>(self, runtime: &'a LocalRuntime) -> LocalTcpStream<'a>;
}

Here we're spawning a bunch of tasks in a group, which lives on a single runtime. The tokio runtime can migrate this group, if needed (perhaps when instructed by the application); this might involve tracking the uring tasks issued by this group, and either forwarding results cross-thread or cancelling and reissuing those tasks during migration. Within the group, we can avoid synchronization overhead most of the time - ie, when we're not actively in the middle of a migration - as we know that all of our IO object handles are not shared between threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a system could be implemented as an additional crate or at the application level. I am inclined to punt for now and explore balancing strategies in a "real app".

DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Show resolved Hide resolved
DESIGN.md Show resolved Hide resolved
impl File {
async fn read_at(&self, buf: buf::Slice, pos: u64) -> BufResult<usize>;

async fn write_at(&self, buf: buf::Slice, pos: u64) -> BufResult<usize>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, it should support writing from shared buffers, such as Bytes or Arc<[u8]>. I think using AsRef<[u8]> would actually be the best option here, since it's the most flexible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bytes is possible, but probably better supported as an IoBuf backend. Supporting a trait is possible but adds complexity/overhead as the trait would probably need to be boxed into a trait object and stored internally while the operation is in-flight.


```rust
impl TcpStream {
async fn close(self) { ... }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this differ from or relate to AsyncWrite's shutdown? Could these two APIs be unified somehow?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is an "instead" question.
For glommio, I always provide a shutdown and/or close functions.
However there are situations in which it is simply not possible to call them. One example are traits that contains consuming functions, but there are also situations for instance where an LSM compaction finishes operating on a file stream, but there are still readers (you can have logic to wait for the readers, it is just too complex)

The best way is to provide indeed a close function and steer users towards using it, but you'll have to choose between one of the mechanisms for close-on-drop.

I heavily prefer async-close-on-drop. The "too-many-files" is not a big problem in practice, because the queueing into the ring is always synchronous, and ultimately you have control over when to dispatch.

It's even possible to keep a counter of files that are asked-to-close-but-not-yet-closed, and force io_uring_submit dispatch when the number becomes too big. Calling the system call itself is non-blocking

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not quite sure what you're talking about here - both shutdown and close are asynchronous.

Either way, after reading more in to the difference between shutdown and close I finally understand why they are different: shutdown only shuts down the write end of the TcpStream and doesn't take ownership, whereas close is shutdown followed by actually closing the fd. I still think it's a bit confusing to have two AsyncDrop-like mechanisms, but I don't see an obvious way around it.

DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Show resolved Hide resolved
assert_eq!(slice.capacity(), 100);
```

A trait argument for reading and writing may be possible as a future
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ensures there are a minimum number of workers
// in the runtime that are flagged as with capacity
// to avoid total starvation.
current_worker::wait_for_capacity().await;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that "worker" is a new noun in this API? If so, is spawn_on_each_thread more like "spawn on each worker?" Or should this be current_thread::wait_for_capacity instead?

But also, if the implication is that the only way to saturate your runtime is to always be accepting connections on all threads that are dedicated to it, should spawn_on_each_thread take a FnMut() -> impl Future instead to remove the loop? The loop seems error prone, because if it ever exited (due to panic/etc) you'd have an idle thread spawned by the runtime, whereas with a function the runtime could restart threads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be current_thread, this snippet is just to illustrate the concept and not to include within tokio-uring. Either way, we can explore the options outside of the crate.

DESIGN.md Show resolved Hide resolved
///
/// This is implemented as a new type to implement std::ops::Try once
/// the trait is stabilized.
type BufResult<T> = (std::io::Result<T>, buf::Slice);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit awkward because users cannot directly use ?.

Maybe we need a custom Error type that allows fd and buffer to be take out from it. or let user pass &mut Option<Buf>, like https://github.com/quininer/ritsu/blob/master/src/actions/io.rs#L18

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newtype could implement a method to convert into a Result and drop the buffer on errors, I suppose. let buf = file.read_(0, buf).await.result()?; or something.

Copy link
Member Author

@carllerche carllerche Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller needs to get the buffer back on success. It is awkward, but also isn't intended to be the primary read / write API. The buf stream API (described later) will be more common.

DESIGN.md Show resolved Hide resolved
my_tcp_stream.close().await;
```

The resource must still tolerate the caller dropping it without being explicitly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, tokio-uring will close the resource in the background, avoiding blocking the runtime.

We can be optimistic that closing will not cause congestion, because we also do this under epoll.

The drop handler will move ownership of the resource handle to the runtime and submit cancellation requests for any in-flight operation. Once all existing in-flight operations complete, the runtime will submit a close operation.

This sounds quite complicated, and maybe it's easier to use reference counting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"move ownership" is probably a poor choice of words here. It is a logical transfer of ownership. The value is already stored in the runtime's internal storage. The implementation changes the operation state to Ignored which signals to the runtime that it owns that state now.

socket, to the kernel using the submission queue. The kernel then performs the
operation. On completion, the kernel returns the operation results via the
completion queue and notifies the process. The `io_uring_enter` syscall flushes
the submission queue and acquires any pending completion events. Upon request,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean for a specific syscall to "acquire a completion event" ? (i.e. The syscall itself cannot own responsibility for handling that event, right? So is it handing off that responsibility to whomever in user space invoked the syscall ?)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Or, I guess since the syscall is allowed to block the thread, it can have that responsibility? But below you say waiting for a minimum number of completion events; not all of the events... so I'm still confused I guess.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acquire at the memory ordering level (think atomic load operation w/ Acquire ordering).

This read function takes ownership of the buffer; however, any pointer to the
buffer obtained from a value becomes invalid when the value moves. Storing the
buffer value at a stable location while the operation is in-flight should be
sufficient to satisfy safety.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's technically safe to return a new slice or pointer on each call to as_mut(), as bonkers as that may be, so it might need something like unsafe trait TrustedAsMut: AsMut<[u8]> {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we hold ownership of T and guarantee to call as_mut only once, I think this is trusted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct, but I am pretty sure that as long as as_mut() is called only once and is kept at a stable location, it should be fine.... but that is also why I am punting 😆

DESIGN.md Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
kind: Kind,
}

enum Kind {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just for concept purposes? Do you plan to instead actually use like a vtable design kind of like in bytes? Otherwise accessing the slice will require a match every single time, which means the performance downsides of SBO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might eventually, but I am not worrying about it now. It could be changed to avoid the branch on deref.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, eventually, we might "just" use bytes::Bytes.

@withoutboats
Copy link

Overall there's a lot here that's similar to ringbahn. I'll leave some notes about the differences.

ringbahn is specifically designed to be externally extensible in a number of ways:

  1. A user of ringbahn can define how io_uring_enter gets called; this is done via what's called the Drive trait. ringbahn's code just requests from the implementation of Drive a certain number of SQEs, and then expects the drive implementer to somehow call ringbahn's complete function to wake tasks or clean up resources as events are completed. This is how ringbahn could allow both multi-threaded and single-threaded implementations (note that this means there is a Mutex internal to ringbahn, basically around its equivalent to this document's Lifecycle type, though it should be very low contention). Obviously, end users are not expected to implement this, but its designed to allow multiple other frameworks built on top of it to share its code.
  2. A user of ringbahn can define their own operations; there's no centralized enums of all the possible operations as appears here in the State enum. This only really requires one modification to the core code: instead of storing the buffers in the State enum, the future types owns any memory associated with the operation & if its dropped, the equivalent of the Ignored enum takes ownership and holds a callback to clean up that state. This is not only beneficial because it the set of primitive ops supported by io-uring is frequently growing, but because it could in theory allow users of ringbahn to define their own complex multi-step "operations" that submit multiple SQEs that are linked together and then tying that state to the completion of the whole chain.
  3. ringbahn's equivalent of TcpStream etc is not extensible in terms of buffer strategy (i.e. there's no realistic way to use automatic buffer selection with ringbahn), but it is supposed to be if I were able to work on it again in the future. So rather than enumerating the kinds of buffers, it would ideally support arbitrary buffers that implement some trait to be called at the right times. I'm certain this is possible, my dayjob keeps me away from completing the work.

My impression is extensibility in this manner has never been a priority for tokio, so I'm not surprised to see the differences. Its certainly easier if you don't do this.

The other big difference is that ringbahn's provided IO handles use AsyncBufRead instead of passing ownership in and out. I think any good implementation should support both (ringbahn doesn't provide the ownership passing API, but I believe it would be straightforward to implement on top of its existing APIs).

for the duration of their lifecycle. Each runtime thread will own a dedicated
submission and completion queue pair, and operations are submitted using the
submission queue associated with the current thread. Operation completion
futures will not implement `Send`, guaranteeing that they remain on the thread
Copy link

@Diggsey Diggsey Apr 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have quite a large impact on the ecosystem. Is there a possibility to make the futures be Send, but only incur a significant performance penalty if polled on another thread?

For example, the source of the !Send bound appears to be the necessity to access the driver state when polling:
https://github.com/tokio-rs/tokio-uring/blob/master/src/driver/op.rs#L15

I could imagine a solution where the Op stores an Arc<MagicCell<driver::Inner>>. The idea behind MagicCell is that it would perform a runtime Sync check (ie. check that the accessing thread is the same as the thread it was created on) before allowing access to the interior.

If this check fails, the Op would do something like this pseudocode:

let (tx, rx) = Proxy::new();
let original_op = mem::replace(self, rx);

original_thread.send(original_op.and_then(tx));

As long as tasks are fairly sticky, then the overhead would only be incurred for in-progress ops when a task is actually moved. Of course, there's the question of whether the overhead of the Arc and MagicCell would outweigh the benefits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible but, part of the goal for tokio-uring, is to provide a synchronization-free option.

Related to the "ecosystem", there already are !Send options out there (e.g. glommio), so it already is an issue to deal with.

@Diggsey
Copy link

Diggsey commented Apr 8, 2021

How do you see this being used? Is the idea that application authors would directly build on top of tokio-uring instead of tokio?

If so, would it make sense to split the new public API (and associated execution model) from the underlying uring-based implementation, so that applications are still building against a platform-independent interface? (Even if the only back-end is linux specific right now)

AIUI io-uring is basically identical to IO completion ports on windows, with the innovation that operations can be submitted without a system-call, so it's plausible that there would be more than one "backend" that would benefit from this execution model. And it would also be desirable to have a fallback implementation based on mio for platforms that cannot support io-uring or other backend directly.

Copy link

@nikomatsakis nikomatsakis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one quick nit

DESIGN.md Outdated Show resolved Hide resolved
@Kestrer
Copy link

Kestrer commented Apr 8, 2021

io_uring is similar to IOCP in that they're both completion-based, but it has one major difference that makes it really hard to abstract over both: the threading model. io_uring naturally lends itself to a ring-per-thread model where each thread is self-contained and manages its own I/O. For Rust, this means thread-local runtimes, little synchronization and lots of !Send types. On the other hand IOCP is designed to work in work-stealing systems: there is typically one global IOCP instance that executes all I/O, and wakes up an essentially arbitrary thread when a completion status is received. For Rust, this means a single multithreaded work-stealing runtime, lots of synchronization and Send types.

One way to reconcile this difference is to always use a runtime-per-thread model, so on Windows there would be an IOCP instance for each thread (I don't know enough about IOCP to say whether this is a bad idea). Alternatively, there could be one global io_uring instance protected by a mutex, or an io_uring instance for each thread where submission queues are shared with a mutex but completion queues are thread-local.

I really don't know which solution is best; they all have disadvantages when run on platforms that don't natively support the model. I think in an ideal world, io_uring would support multiple submission/completion queues on a single io_uring instance so we can use a multithreaded work stealing model everywhere, but that might not ever happen.

DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
Comment on lines +444 to +457
Internally, byte streams submit read operations using the default buffer pool.
Additional methods exist to take and place buffers, supporting zero-copy piping
between two byte streams.

```rust
my_tcp_stream.fill_buf().await?;
let buf: IoBuf = my_tcp_stream.take_read_buf();

// Mutate `buf` if needed here.

my_other_stream.place_write_buf(buf);
my_other_stream.flush().await?;
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to support IORING_OP_SPLICE or IORING_OP_TEE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes via inherent methods on relevant types. It should fit into the model proposed here.

DESIGN.md Outdated Show resolved Hide resolved

// Read the first 1024 bytes of the file, when `std::ops::Try` is
// stable, the `?` can be applied directly on the `BufResult` type.
let BufResult(res, buf) = file.read_at(0, buf.slice(0..1024)).await;
Copy link

@pickfire pickfire Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we don't use Index here rather than having a separate slice function? We could do like buf[0..1024] and both should panic when the start or end index is out of the capacity.

I wonder if there is a way to avoid all the bound checking when passing the buffer over and back, because the size is usually very large but at the same time can still be expanded at runtime (Vec).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably because Index is required to return a reference but .slice has to return an owned value.

@raftario
Copy link

raftario commented Apr 9, 2021

I feel like keeping io_uring and IOCP runtimes separate would be the best choice. There's no question about a thread per core architecture being a better fit for io_uring, and while IOCP can be used that way too (and pretty efficiently afaik if you create the completion port with a concurrency value of one), it would still be a bunch of additional code to maintain and logic to generalise. Note that IOCP is also typically used with the threadpool provided by the Windows kernel itself, as illustrated by the .NET TPL (even though that will change in .NET 6).

The way I see it, "vanilla" tokio already does an amazing job at being an efficient and portable runtime. If you're willing to pay the complexity cost of using a thread per core architecture and dealing with completion based IO for the extra performance, you probably already know where that code is going to run. Separate, targeted runtimes seem like the obvious choice in that scenario, especially since depending on io_uring itself isn't particularly portable (a portable runtime without support for BSD and Darwin, and 90% of Linux machines out there (just throwing random numbers here), doesn't sound like the best way to build portable stuff).

Comment on lines +573 to +577
enum State {
// An in-flight read operation, `None` when reading into
// a kernel-owned buffer pool
Read { buffer: Option<buf::Slice> },
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to have two different State variations where we have a generic for kernel-owned buffer pool and Vec (if they are always the same)? Will one state be None and another state be Some for Read?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be, but before making that tweak I would like to measure the impact. It doesn't materially change the design though so we can explore it in the future.

DESIGN.md Outdated Show resolved Hide resolved
DESIGN.md Outdated Show resolved Hide resolved
}

/// The caller allocates a buffer
let buf = buf::IoBuf::with_capacity(4096);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about letting IoBuf::with_capacity return a Result so that it can support fallible allocations with having to provide separate try_* methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you mean "without having to provide separate try_* methods? I would default to following std's lead here w/ Vec. The question is whether or not supporting fallible allocation should be the primary use case.

// ensures there are a minimum number of workers
// in the runtime that are flagged as with capacity
// to avoid total starvation.
current_worker::wait_for_capacity().await;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that "worker" is a new noun in this API? If so, is spawn_on_each_thread more like "spawn on each worker?" Or should this be current_thread::wait_for_capacity instead?

But also, if the implication is that the only way to saturate your runtime is to always be accepting connections on all threads that are dedicated to it, should spawn_on_each_thread take a FnMut() -> impl Future instead to remove the loop? The loop seems error prone, because if it ever exited (due to panic/etc) you'd have an idle thread spawned by the runtime, whereas with a function the runtime could restart threads.


let socket = listener.accept().await;

spawn(async move { ... });
Copy link

@stuhood stuhood Apr 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The socket: TcpStream is not itself a uring "operation", afaict, so it could be Send. Does that imply that load balancing could still happen between accepting connections and servicing them?

Load balancing of tasks to threads would then happen during spawn. You would spawn N tasks to accept (having all threads accept seems like overkill?), and upon accept they would spawn new tasks (which would likely go to other threads) to service the connections. The runtime would need to decide which thread to target each time it spawned a task, which could be tricky (long lived cheap tasks vs short lived expensive tasks, etc).

From an API perspective, the task would still be !Send, but the spawn function would require Send for its argument(s) in order to move the inputs to the task onto its permanent home thread.


This question is particularly important I think because the effect of current_worker::wait_for_capacity seems to be that users end up creating their own thread pools (i.e. "a pool of threads fed by a queue") with the assistance of the spawn_on_each_thread function. In the accept case, the accept call is what waits on your queue: but other usecases that spawn work recursively would need to wait on an explicit queue of work after their wait_for_capacity call (if they actually wanted to balance across all threads that were available, rather than having all spawned recursion pinned once it enters the runtime).

It's not clear that any extra flexibility provided by this API justifies having applications need to implement their own load-balancing around spawn calls.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. Just noticed that this is addressed at the end in https://github.com/tokio-rs/tokio-uring/blob/design-doc/DESIGN.md#load-balancing-spawn-function . The idea that you might have load-balanced vs non-load-balanced spawn functions is interesting.

This read function takes ownership of the buffer; however, any pointer to the
buffer obtained from a value becomes invalid when the value moves. Storing the
buffer value at a stable location while the operation is in-flight should be
sufficient to satisfy safety.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as_mut() creates a borrow, so it's already not possible to move the value until read returns; the problem is ensuring the lifetime of T is at least as long as the event's completion (ie the hazard is Drop).


If the drop handler must process closing a resource in the background, it will
notify the developer by emitting a warning message using [`tracing`]. In the
future, it may be possible for Rust to provide a `#[must_not_drop]` attribute.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem compatible with panicking, and is needlessly unergonomic for the actual benefit it brings. Consider a TCP proxy that spawns many tasks, each of which splits two TcpStreams into two halves and copies between them concurrently.

In the happy case, you join on the (read A -> write B) and (read B -> write A) tasks then drop the sockets. There are no outstanding operations after joining, so the sockets may close immediately.

In an error case such as the A socket being reset, the (read A) task returns immediately with an error that indicates the (read B) task is doomed to fail too. There are two distinct possibilities:

  1. we propagate the error from (read A) and 'gracefully' terminate the proxying task, awaiting on close() etc. before the task is finally dropped
  2. we panic from (read A) and all the tasks and sockets are necessarily dropped

The runtime must still handle the panicking case; the only difference in (1) is that we keep an application task running for the duration. It's not obvious what accounting benefit this brings - as surely the wait_for_capacity function suggested earlier would be dependent on resource usage rather than task count which may vary significantly between application types.

This comes at the cost of forcing every function that takes ownership of an I/O object to await on close() and therefore required to be async, or some wrapper type written to handle this with a custom Drop that spawns the close(), or (more likely) be rewritten using mutable references only.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct re: panicking, which means drop must always be safe and correct, though not necessarily ideal. The idea of #[must_not_drop] is that it would be a lint and not enforced.

Just like with #[must_use] lints, if the developer means to drop and let cleanup happen in the background, they can silence the error (via some TBD mechanism).

// Checked-out from a pool
match my_buffer_pool.checkout() {
Ok(io_buf) => ...,
Err(e) => panic!("buffer pool empty"),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "the buffer pool is empty" is the only possible error here, I'd prefer Option over Result, since it doesn't really feel like an error condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, there are conflicting patterns in std. I don' have a huge opinion there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general observation has been that Option should be used if the alternative case a) carries no information; b) is common, not exceptional; and c) is unlikely to later be propagated as an Err. I think all of those apply here, and therefore that Option is the better choice. The exception would be if you anticipate that checkout() can fail for different reasons as well. But even in that case, I'd prefer Result<Option<Pool>, E>.

thread-pool. The current API would remain, and the implementation would use
io-uring when supported by the operating system. The tokio-uring APIs may form
the basis for a Tokio 2.0 release, though this cannot happen until 2024 at the
earliest. As an intermittent step, the tokio-uring crate could explore
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/intermittent/interim/ :)

operation explicitly. The application also can create its own set of io-uring
queues using the io-uring crate directly.

### Load-balancing spawn function
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About load balance, If we do clientside load balancing, the server tokio-uring should support which feature? We should mention is .

The queue's single-producer characteristic optimizes for a single thread to own
a given submission queue. Supporting a multi-threaded runtime requires either
synchronizing pushing onto the submission queue or creating a queue pair per
thread. The tokio-uring runtime uses an isolated thread-per-core model,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'thread-per-core' model here imply CPU affinity as well? I wonder sched_setaffinity() will be planned or not as I could not see related implementation code yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thread-per-core in this case refers to its use of a single, current-thread runtime, where, if users want more worker threads, they can spawn more runtimes in parallel.

Copy link
Contributor

@Noah-Kennedy Noah-Kennedy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to get this in. We can have further discussion in new PRs to make changes.

@Noah-Kennedy Noah-Kennedy merged commit c357125 into master Aug 9, 2022
@Noah-Kennedy Noah-Kennedy deleted the design-doc branch August 9, 2022 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.