-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chunks get combined in 4d array reshape #5544
Comments
I think your intuition that the intermediate rechunk-merge and split / reshape can be avoided is correct. The trick is probably to figure out when the output chunks align with the input chunks. If no one beats me to it, I can take a look next week. |
I'm curious, if the order of the dimensions is flipped then does this problem persist? I.e. (5, 20, 20) -> (5, 400)? |
My guess is that the implicit C-ordering of data might make the intermediate reshaping actually necessary in some cases, even if we don't care in this case. |
It's the same. Same number of tasks and graph structure. |
I have a similar problem, but perhaps even simpler. I just want to reshape an array with chunk sizes of 1 along one dimension. I would like the reshaped array to have chunksize (1,1).
This is important to me because I want |
cc @TomAugspurger in case he's interested in this from the Anaconda/Pangeo angle |
@nbren12 to confirm, your ideal chunking in the output of |
@TomAugspurger That is correct. I have been running out of memory due to having too big chunks, and was trying to very carefully ensure that "chunks in= chunks_out". |
Just to provide some context, my workflow involves open a list of urls as dask delayed objects which are wrapped with xarray. The url string is parsed into dimension information, which I can represent as a stacked "xarray" dimension. I want to ensure that |
I just ran into this, as well, using chunksize=1 dimensions and seeing my graphs get complex. In the case of chunksize=1 dimensions, there is the possibility of a shortcut implementation. It seems pretty thorny to optimally get this optimal for all cases. A less trivial case could be demonstrated with:
In this case, the individual chunks are not contiguous and cannot be simply reshaped and stuck end-to-end. That said, the above case gets rechunked in
which I find odd. Here's an example of a fastpath implementation for chunksize=1: |
@nbren12 I think your use-case would be solved by something like #6272. I'll see if I can turn that into a PR today. In [2]: a = da.ones(10, chunks=1)
In [3]: o = a.reshape(2, 5, inchunks=a.chunks, outchunks=((1, 1), (1, 1, 1, 1, 1))) @chrisroat what's your desired output chunking for
That would be a "zero-communication" reshape / rechunk IIUC. |
@chrisroat I think this is what you would need for your problem inchunks = ((1, 1, 1, 1), (2, 2))
outchunks = (2,) * 8,
out = arr.reshape(16, inchunks=inchunks, outchunks=outchunks)
out.visualize() Figuring out the right inchunks / outchunks isn't the easiest, but I'm not sure how much we can simplify the API while giving total control. |
I agree that that looks pretty good. It seems the method you use here is to find the smallest chunk along a dimension and rechunk that dimension to that chunksize. |
I think that's right, if you want to minimize communication between chunks. It does result in more chunks / tasks, but that might be a good tradeoff in certain cases. |
Could this algorithm be implemented and controlled using a kwarg like |
I was considering that. My concerns were
So I think that if we think people won't really need complete flexibility in the chunk structure then the |
If 1 can be solved then I think the following code isn't too bad. It allows easy opt-in to one of two strategies (default vs "preserve-chunks") and is flexible enough to allow motivated users to have full control (this might even lead to discovery of new strategies?) def reshape(..., inchunks=None, outchunks=None, strategy=None):
if strategy and any([inchunks, outchunks]):
raise ValueError
if strategy is not None:
inchunks, outchunks = get_chunks_from_strategy(strategy)
... |
I'm running into the same issue. Being able to preserve the original chunk size as proposed by @dcherian would solve my problem. After reading the discussion above I was wondering about the following: Is there a difference in performance when rechunking small blocks by combining them, and rechunking large blocks by splitting them? In the first case the strategy of preserving the chunks might be more flexible then you think because you would still be able to rechunk after the reshape without much overhead. |
I was looking into this again last week, to implement the I don't think that "just maintain the original chunksize" will work. There are cases like #5544 (comment) where we need to rechunk the input to smaller chunks.
I'm not sure offhand, but all else equal the more tasks you have, the slower it'll be. In this case we're trading (some) additional tasks for less communication in the hope that it's faster. But we'll want to avoid doing unnecessary rechunking. At a minimum though, I think we'll have a requirement that the fastest-changing dimension (the last with |
I've identified one special case: When reshaping from a larger to smaller number of dimensions (e.g.
Because of the "all low axes have chunksize 1" property, we avoid needing to rechunk the input and we're merely moving blocks around. So I think that for this special case, not rechunking is the strictly superior strategy. For cases like @chrisroat's in #5544 (comment), we need to rechunk the inputs (since the "early" axes aren't all chunksize 1) and so the strategy avoiding rechunk-merge isn't necessarily better. I think it'll depend on the overhead of scheduling additional tasks, the cost of moving data around, maximum memory usage, ... |
When the slow-moving (early) axes in `.reshape` are all size 1, then we can avoid an intermediate rechunk which could cause memory issues. ``` 00 01 | 02 03 # a[0, :, :] ----- | ----- 04 05 | 06 07 08 09 | 10 11 ============= 12 13 | 14 15 # a[1, :, :] ----- | ----- 16 17 | 18 19 20 21 | 22 23 -> (3, 4) 00 01 | 02 03 ----- | ----- 04 05 | 06 07 08 09 | 10 11 ----- | ----- 12 13 | 14 15 ----- | ----- 16 17 | 18 19 20 21 | 22 23 ``` xref dask#5544, specifically the examples given in dask#5544 (comment).
When the slow-moving (early) axes in `.reshape` are all size 1, then we can avoid an intermediate rechunk which could cause memory issues. ``` 00 01 | 02 03 # a[0, :, :] ----- | ----- 04 05 | 06 07 08 09 | 10 11 ============= 12 13 | 14 15 # a[1, :, :] ----- | ----- 16 17 | 18 19 20 21 | 22 23 -> (3, 4) 00 01 | 02 03 ----- | ----- 04 05 | 06 07 08 09 | 10 11 ----- | ----- 12 13 | 14 15 ----- | ----- 16 17 | 18 19 20 21 | 22 23 ``` xref dask#5544, specifically the examples given in dask#5544 (comment).
That's great. This seems to align with my use case. |
Good to hear. Unless I'm mistaken, then #5544 (comment) indicates the kind of API / implementation we'd need to solve @chrisroat's problem in #5544 (comment). To do a "zero-communication" / "no-merge" reshape, we need all the early axes to have a chunksize of 1. So that's necessary and sufficient for this optimization to kick in. That also covers @rabernat's original use-case. So if I'm right that having a chunksize of 1 is necessary, then we just need a keyword in data = da.ones((20, 20, 5), chunks=(10, 10, 5))
data.rechunk(400, 5), strategy="minimize_memory") # equivalent to `data.rechunk({0: 1}).rechunk(400, 5)` Where Or we could have a boolean flag like |
* Avoid rechunking in reshape with chunksize=1 When the slow-moving (early) axes in `.reshape` are all size 1, then we can avoid an intermediate rechunk which could cause memory issues. ``` 00 01 | 02 03 # a[0, :, :] ----- | ----- 04 05 | 06 07 08 09 | 10 11 ============= 12 13 | 14 15 # a[1, :, :] ----- | ----- 16 17 | 18 19 20 21 | 22 23 -> (3, 4) 00 01 | 02 03 ----- | ----- 04 05 | 06 07 08 09 | 10 11 ----- | ----- 12 13 | 14 15 ----- | ----- 16 17 | 18 19 20 21 | 22 23 ``` xref #5544, specifically the examples given in #5544 (comment). * fix conditioni * remove breakpoint comment * API: Added merge_chunks to reshape Adds a keyword to reshape to control merge / rechunking. See the documentation for an explanation. * update images
* Avoid rechunking in reshape with chunksize=1 When the slow-moving (early) axes in `.reshape` are all size 1, then we can avoid an intermediate rechunk which could cause memory issues. ``` 00 01 | 02 03 # a[0, :, :] ----- | ----- 04 05 | 06 07 08 09 | 10 11 ============= 12 13 | 14 15 # a[1, :, :] ----- | ----- 16 17 | 18 19 20 21 | 22 23 -> (3, 4) 00 01 | 02 03 ----- | ----- 04 05 | 06 07 08 09 | 10 11 ----- | ----- 12 13 | 14 15 ----- | ----- 16 17 | 18 19 20 21 | 22 23 ``` xref dask#5544, specifically the examples given in dask#5544 (comment). * fix conditioni * remove breakpoint comment * API: Added merge_chunks to reshape Adds a keyword to reshape to control merge / rechunking. See the documentation for an explanation. * update images
Love it when a 5-year-old issue gets closed! 💪 |
I want to reshape a 4D dask array in such a way that I expect should preserve the original chunk structure. I am finding that
reshape
is instead rechunking my array in a non-optimal way.Things work as expected in 3D:
The chunks simply get stacked into a single column. Examining the graph, however, reveals there is an intermediate merge step:
If I add another dimension at the beginning, things don't look as nice:
Rather than seeing a neat stack of 8 chunks as I expected, the chunks have been fused.
This causes big problems for me when I am trying to do some processing of very large arrays. I can't afford to have the chunks fused, or else I will run out of memory.
Dask version 2.6.0+10.g8179f7f3
The text was updated successfully, but these errors were encountered: