object_store: Using `io_uring`? #4631

JackKelly · 2023-08-02T15:36:07Z

Which part is this question about
object_store's code.

Describe your question
For Zarr, we may want to read on the order of 1 million parts of files per second (from a single machine). It's possible that the only way to achieve this performance will be to use io_uring to send many IO operations to the Linux kernel using just a single system call.

Would object_store ever consider implementing an async io_uring backend for get_ranges? (I may be able to write the PR, with some hand-holding!)

Additional context
io_uring is a newish feature of the Linux kernel that allows for requesting many IO operations with a single system call - including local file operations and network operations - without any memory copying, and with minimal system calls. Some database folks seem pretty excited about io_uring. Some benchmarks show that io_uring can deliver almost 20x more IOPs for random reads than the previous approach.

The text was updated successfully, but these errors were encountered:

tustvold · 2023-08-02T18:42:35Z

I would probably want to see some numbers and go from there, it isn't immediately obvious to me that io_uring would be beneficial for reading immutable chunks of data from disk, especially if the workload is doing any non-trivial computation alongside. The major argument I've heard is for systems doing mixed IO, or with custom buffer pooling, neither of which object store is

JackKelly · 2023-08-03T11:11:49Z

OK, cool, that's good to know. Thank you for your quick reply. No worries at all if object_store isn't the right place for this functionality.

Just to make sure... please let me give a little more detail about what I'd ultimately like to do...

First, some context: Zarr has been around for a while. As you probably know, the main idea behind Zarr is very simple: We take a large multi-dimensional array and save it to disk as multi-dimensional, compressed chunks. The user can request an arbitrary slice of the overall array, and Zarr will load the appropriate chunks, decompress them, and merge them into a single ndarray. Zarr-Python, the main implementation of Zarr, is currently single-threaded.

We're now exploring ways to use multiple CPU cores in parallel to load, decompress, and copy each decompressed Zarr chunk into a "final" array, as fast as possible. (Many Zarr users would benefit if Zarr could max-out the hardware).

If we were to implement our own IO backend using io_uring, we might first submit our queue of, say, 1 million read operations to the kernel. Then we'd have a thread pool (or perhaps we'd use an async executor) with roughly as many threads as there are logical CPU cores. Each worker thread would run a loop which starts by grabbing data from the io_uring completion queue, then immediately decompresses the chunk, and then - while the decompressed data is still in the CPU cache - write the decompressed chunk into the final array in RAM. So we'd need the load, decompression, and copy steps to happen in very quick succession; and ideally within a single thread per chunk (to make the code as "cache-friendly" as possible).

Would you say that object_store isn't the right place to implement this batched, parallel "load-decompress-copy" functionality? Even if object_store implemented an io_uring backend, my guess is that it wouldn't be appropriate to modify object_store to allow for processing to be done on chunk n-1 whilst chunk n is still being loaded. (If that makes sense?!) Instead, we'd first call object_stores's get_ranges function. Then we'd await the Future returned by get_ranges, which will only return data when all the chunks have been loaded. So we couldn't simultaneously decompress chunk n-1 whilst loading chunk n. Is that right?

tustvold · 2023-08-03T13:28:17Z

I would suggest first getting something simple working with tokio::spawn, or some other threadpool abstraction, and the existing APIs, and then go from there. I would recommend against reaching for solutions like io_uring until you have confirmed that simpler solutions are insufficient, from what I understand of your use-case I'm not sure io_uring would yield tangible benefits.

JackKelly · 2024-01-23T17:27:34Z

Just a quick update... I am hoping to provide some benchmarks within a few months. More details here: JackKelly/light-speed-io#27

criccomini · 2024-08-14T23:49:19Z

I've got no horse in this race, but AnyBlob and their paper are worth a look:

https://github.com/durner/AnyBlob
Exploiting Cloud Object Storage for High-Performance Analytics

They're using io_uring to accelerate cloud object downloads.

JackKelly added the question Further information is requested label Aug 2, 2023

haruband mentioned this issue Dec 8, 2023

2024 Roadmap ingkle-oss/deltaquery#2

Open

10 tasks

JackKelly mentioned this issue Jan 23, 2024

Does LSIO need to exist?! Does object_store already do everything we need? If not, can we extend object_store instead of creating LSIO? JackKelly/light-speed-io#27

Closed

JackKelly mentioned this issue Jan 30, 2024

Share benchmarks with object_store folks. JackKelly/light-speed-io#35

Open

tustvold mentioned this issue Mar 16, 2024

Glommio based IO version for Parquet #5240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object_store: Using `io_uring`? #4631

object_store: Using `io_uring`? #4631

JackKelly commented Aug 2, 2023

tustvold commented Aug 2, 2023 •

edited

Loading

JackKelly commented Aug 3, 2023

tustvold commented Aug 3, 2023 •

edited

Loading

JackKelly commented Jan 23, 2024

criccomini commented Aug 14, 2024

object_store: Using io_uring? #4631

object_store: Using io_uring? #4631

Comments

JackKelly commented Aug 2, 2023

tustvold commented Aug 2, 2023 • edited Loading

JackKelly commented Aug 3, 2023

tustvold commented Aug 3, 2023 • edited Loading

JackKelly commented Jan 23, 2024

criccomini commented Aug 14, 2024

object_store: Using `io_uring`? #4631

object_store: Using `io_uring`? #4631

tustvold commented Aug 2, 2023 •

edited

Loading

tustvold commented Aug 3, 2023 •

edited

Loading