Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: Using io_uring? #4631

Open
JackKelly opened this issue Aug 2, 2023 · 5 comments
Open

object_store: Using io_uring? #4631

JackKelly opened this issue Aug 2, 2023 · 5 comments
Labels
question Further information is requested

Comments

@JackKelly
Copy link

Which part is this question about
object_store's code.

Describe your question
For Zarr, we may want to read on the order of 1 million parts of files per second (from a single machine). It's possible that the only way to achieve this performance will be to use io_uring to send many IO operations to the Linux kernel using just a single system call.

Would object_store ever consider implementing an async io_uring backend for get_ranges? (I may be able to write the PR, with some hand-holding!)

Additional context
io_uring is a newish feature of the Linux kernel that allows for requesting many IO operations with a single system call - including local file operations and network operations - without any memory copying, and with minimal system calls. Some database folks seem pretty excited about io_uring. Some benchmarks show that io_uring can deliver almost 20x more IOPs for random reads than the previous approach.

@JackKelly JackKelly added the question Further information is requested label Aug 2, 2023
@tustvold
Copy link
Contributor

tustvold commented Aug 2, 2023

I would probably want to see some numbers and go from there, it isn't immediately obvious to me that io_uring would be beneficial for reading immutable chunks of data from disk, especially if the workload is doing any non-trivial computation alongside. The major argument I've heard is for systems doing mixed IO, or with custom buffer pooling, neither of which object store is

@JackKelly
Copy link
Author

OK, cool, that's good to know. Thank you for your quick reply. No worries at all if object_store isn't the right place for this functionality.

Just to make sure... please let me give a little more detail about what I'd ultimately like to do...

First, some context: Zarr has been around for a while. As you probably know, the main idea behind Zarr is very simple: We take a large multi-dimensional array and save it to disk as multi-dimensional, compressed chunks. The user can request an arbitrary slice of the overall array, and Zarr will load the appropriate chunks, decompress them, and merge them into a single ndarray. Zarr-Python, the main implementation of Zarr, is currently single-threaded.

We're now exploring ways to use multiple CPU cores in parallel to load, decompress, and copy each decompressed Zarr chunk into a "final" array, as fast as possible. (Many Zarr users would benefit if Zarr could max-out the hardware).

If we were to implement our own IO backend using io_uring, we might first submit our queue of, say, 1 million read operations to the kernel. Then we'd have a thread pool (or perhaps we'd use an async executor) with roughly as many threads as there are logical CPU cores. Each worker thread would run a loop which starts by grabbing data from the io_uring completion queue, then immediately decompresses the chunk, and then - while the decompressed data is still in the CPU cache - write the decompressed chunk into the final array in RAM. So we'd need the load, decompression, and copy steps to happen in very quick succession; and ideally within a single thread per chunk (to make the code as "cache-friendly" as possible).

Would you say that object_store isn't the right place to implement this batched, parallel "load-decompress-copy" functionality? Even if object_store implemented an io_uring backend, my guess is that it wouldn't be appropriate to modify object_store to allow for processing to be done on chunk n-1 whilst chunk n is still being loaded. (If that makes sense?!) Instead, we'd first call object_stores's get_ranges function. Then we'd await the Future returned by get_ranges, which will only return data when all the chunks have been loaded. So we couldn't simultaneously decompress chunk n-1 whilst loading chunk n. Is that right?

@tustvold
Copy link
Contributor

tustvold commented Aug 3, 2023

I would suggest first getting something simple working with tokio::spawn, or some other threadpool abstraction, and the existing APIs, and then go from there. I would recommend against reaching for solutions like io_uring until you have confirmed that simpler solutions are insufficient, from what I understand of your use-case I'm not sure io_uring would yield tangible benefits.

@JackKelly
Copy link
Author

Just a quick update... I am hoping to provide some benchmarks within a few months. More details here: JackKelly/light-speed-io#27

@criccomini
Copy link
Contributor

I've got no horse in this race, but AnyBlob and their paper are worth a look:

https://github.com/durner/AnyBlob
Exploiting Cloud Object Storage for High-Performance Analytics

They're using io_uring to accelerate cloud object downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants