-
-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow batched/concurrent (de)compression support #1398
Comments
I think this is a great idea! |
Hi @akshaysubr - thanks for sharing this idea and sorry for missing this notification the first time around. I think your suggestion of a multithreaded download / decompression layer is a really great one. This would be useful not only for the sharded format but also for existing datasets. Today we can achieve something similar when we use Zarr together with Dask Array in threaded mode. However, there are many scenarios where a user would want this capability (e.g in ML workflows as you described) when using Dask would be overkill. So I strongly support pursuing this direction. There is a bit of design work to be done in terms of figuring out the best way to implement this feature. It might be helpful to have a synchronous brainstorming session (including @zarr-developers/python-core-devs) on this. |
@rabernat great, thanks for pointing me here. I'd be happy to help out here because we also need multithreaded chunk fetching. Please loop me in on the brainstorming session. |
I would caution against complexity here, at least for the CPU case, for the following reasons:
Having said all that, the there here is GPUs, where the cost-per-call is high compared to the throughput. The points are well made! Skirting, for the moment, the problem of IO directly to the GPU, I find the scenario described attractive but perhaps tricky to implement. It sort of points towards an extra layer in the current chain of operations. It ends up similar to calling dask with partition sizes larger than the native zarr chunk sizes. I guess at this point, it would be good to see what a "batch decode" method might look like, and then we can discuss how it might replace or augment the current loop. (a note on #547 - this feels like what kerchunk achieves?) |
This sounds really interesting! We have a similar use-case as the use-case outlined in the original post... This might be a bit off-topic but perhaps it's useful to give some detail on our use-case (because I suspect we're not alone!) to help give context to the current discussion. Our use-case:
For us, I think the ideal solution would be something like this:
Earlier in the year, I was seriously considering attempting to implement a Zarr library in Rust (with sharding support)... but work is getting increasingly busy, so I'll struggle to get this done... and new developments might solve most of my problems. One last thought: On the question of "how much does concurrency help Zarr read speeds", it might be interesting to benchmark tensorstore / TileDB / caterva (which all implement multithreaded reading of chunks, IIRC). I'm currently hoping to benchmark these things later this year (but no promises!). |
I think it is at least worth trying to quantitatively examine this assumption under the new parameters that will be opened up by sharding. Specifically, say we are fetching a lot of really tiny chunks from S3 as part of a single shard: for example, a 10 MB shard containing 1000 10 KB chunks. In this case, we really only need one request to S3, to the opportunity for async concurrency is limited. In our current implementation, we would then have to loop over 1000 10 KB chunks, decompress each in serial, and assemble them into a contiguous array. On a basic level, it's very hard to imagine that parallelism of some sort would not be able to help here. Multithreaded parallelism on the CPU should be effective (provided we can find a way to release the GIL). Or parallelism within the GPU. Very knowledgeable people on this thread, with a lot of close-to-the-metal experience with high-performance I/O pipelines, think this is a promising direction to explore. Someone should try to implement batched_encode and batched_decode to test these ideas. |
This seems like a pretty glaring problem for performance? |
I feel like this must be solvable. Nearly all of the codecs are implemented in a compiled language, c, cython, etc. If not, we can implement the store / codec pipeline in rust! 🦀 The NVIDIA folks on this thread don't rely on numcodecs--they have their own GPU-native codecs. |
Noting order of magnitude timescales, to decompess 10_000 bytes of original data:
Your batched decompress had better release the GIL for the entirety of the process, rather than repeatedly claim and release it... |
Happy to help with this, you might want to check out |
@rabernat This sounds like a good idea. Please keep me in the loop for this brainstorming discussion.
I agree about having concurrent I/O requests in flight being more important than just threading. In terms of I/O being the bottleneck though, we've found this to not always be true in our ML training use cases for a few reasons, many of which are particular to the specific use case:
We are in the process of implementing this right now. Tagging @Alexey-Kamenev who is working on a draft implementation currently. Will post some snippets here in a couple of weeks after we have something satisfactory. @JackKelly Your use case sounds astonishingly similar to ours. We also have the exact same issue using zarr in an ML training pipeline in PyTorch . The solution we've been using is to do some initial dataset curation before training and use DALI for asynchronous data loading and prefetching to overlap compute and I/O. We currently turn off compression mainly to avoid decompression bottlenecks. Adding GPU compression and reading directly from the original zarr dataset would help all parts of the pipeline from I/O to local cache capacity to higher effective PCIe bandwidth for CPU->GPU transfers.
@rabernat had benchmarked this a while ago in this post. Here is another benchmark that we did for internally:
We were actually planning on using numcodecs similar to what's in kvikio currently (but without batched encode/decode). In nvCOMP, we have some GPU-native codecs like GDeflate, but we also do support many standard codecs including LZ4, Snappy, Deflate and zStandard. One of the big reasons for wanting batched encode and decode is that one can create a zarr store using a standard LZ4 CPU codec but then In general, I wanted to provide some context on how GPU decompression works in nvCOMP. The idea is similar to parallel CPU decompression, with multiple chunks being decompresses independently. But we also add fine grained parallelism so multiple CUDA threads can cooperatively process each chunk. Both these dimensions of parallelism are important to get maximum throughput. Here is an image that describes this better: |
Wanted to update here that an implementation of a Here's the basic structure of the API: class NvCompBatchCodec(Codec):
def __init__(
self,
algorithm: str,
options: Optional[Mapping[str, Any]] = None,
stream: Optional[cp.cuda.Stream] = None,
) -> None:
...
def encode(self, buf):
return self.encode_batch([buf])[0]
def encode_batch(self, bufs: List[Any]) -> List[Any]:
...
def decode(self, buf, out=None):
return self.decode_batch([buf], [out])[0]
def decode_batch(
self, bufs: List[Any], out: Optional[List[Any]] = None
) -> List[Any]:
... Here, the regular class Codec:
"""Codec abstract base class."""
@abstractmethod
def encode(self, buf):
@abstractmethod
def decode(self, buf, out=None):
def encode_batch(self, bufs: List[Any]) -> List[Any]:
return [self.encode(b) for b in bufs]
def decode_batch(
self, bufs: List[Any], out: Optional[List[Any]] = None
) -> List[Any]:
if out is None:
out = [None for i in range(len(bufs))]
return [self.decode(b, o) for b, o in zip(bufs, out)] Then zarr-python can always call into the batched API and have that default to the serialized loop or if a codec supports it, can leverage parallel (de)compresssion. |
Just to connect up two related conversations... Over in the "Zarr Benchmarking & Performance" group, we've briefly discussed (amongst other things) batched async IO and decompression (possibly implemented in Rust). Of course, we should probably keep any focused discussion about batched decompression to this current issue. But I just wanted to flag up the connection. And it's quite likely that we'll discuss related issues in our twice-monthly calls (starting in Sept), so I'd encourage folks who are interested in batched (de)compression to register their availability for a regular meeting. |
@akshaysubr - curious if you feel that the current state of 3.0 concurrent compression is sufficient to close this. Some early performance numbers sure seem to indicate we've really moved the ball on this one. |
Yeah, I think we can close this issue as done. The batched Codec APIs in the way that they are exposed now are good and your performance numbers reflect that as well. If any other concerns come up, they can be raised as separate issues. |
I have a proposal for an enhancement and would like to get feedback on this and potentially better ways of achieving the same goal.
In many use cases that are read I/O bandwidth bound like in dataloaders for AI training, compression is typically turned off because that would end up being the bottleneck. We can get upwards of 10 GB/s of read throughput either from an object store or from a distributed filesystem like lustre. But CPU decompression typically maxes out at ~1 GB/s and that is usually with coarse grained multithreaded parallelism. A nice solution to this problem would be to decompress data on the GPU, which we can do at > 50 GB/s. This would make decompression no longer be the bottleneck while speeding up all other parts of the pipeline: lower use of storage, lower data volume fetched from storage, higher effective cache capacity if caching locally and faster CPU-GPU transfers. The issue with any kind of parallel decompression is that the API needs to support batched or concurrent decompression rather than calling decompression on chunks serially in a loop.
Here is some data comparing throughput of some GPU decompression algorithms in nvcomp to multithreaded zstd on the CPU:
This is conceptually similar to #547 with concurrent chunk accesses, but for compression/decompression.
The idea is to allow
Codec
s in numcodecs to implement abatched_encode
andbatched_decode
in addition to the currentencode
anddecode
methods. When a codec has these methods available, zarr can dispatch a batch of chunks for encode/decode. The codec implementation can then either use a serial loop, multi-threaded parallelism or parallelize on the GPU using nvcomp. I'm envisioning this to be quite similar togetitems
here:zarr-python/zarr/core.py
Line 2061 in 2ff8875
From a GPU (de)compression standpoint, we have thought about using the sharding transformer format in ZEP0002 as the internal format in nvcomp. But after some consideration, this batched compression approach seems much more useful because:
Would really appreciate any suggestions, concerns or other ideas that I might be missing. Also cc @jakirkham since he mentioned that there might be some intersection with Blosc.
The text was updated successfully, but these errors were encountered: