-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
Can we attribute on only the opening and not the writing? |
Hmmmm. Now I'm uncertain of my benchmarking of just the open; I may have messed up. Re-checking, I think you're right, it's actually the write: Times are open / write. m1/m2/m3 are three different months of data.
The resulting zarr file is also larger with .09; for one month it's 37MB under .09 and 26MB under .08. Though there's some variability in the resulting output size. |
I know these are both important and difficult to produce a perfect MCVE (and generous of you to get this far already). But I do think it's important to try and pin down exactly which part is slow — I think we can still minimize this case by a decent amount, even if it's not to something perfectly reproducible. If we produce a similarly shaped array and write that, has that also slowed down? |
Nope. If I create a kinda sorta similar dataset synthetically using numpy and directly populate it as
it writes in 0.07 seconds. (I assume some of what's happening is that it's only reading the original files during the write to zarr.) If I force .load() to happen along with open_mfdataset, it gets a little weird. Under 2023.08 opening the first month takes 20 seconds and writing it takes 0.09. Under 2023.11, it took 161 seconds to open the first month (!), and then the write took 0.09 seconds. This still kinda suggests to me something weird is happening on open_mfdataset. |
I've tried simplifying this in a few ways; none of the simplifications quite reflect the exact slowdown I'm seeing but they all have the direction the same. One test is opening a single POWER netcdf file and .load()ing it:
I've put a copy of two merra2 datafiles at
(they're each 16MB, wasn't sure github would appreciate that). If I return it to the full but somewhat simplified example, I see times that aren't quite as bad as what I see with the multi-file example but it's still slower:
And if I use two .nc files as the source, I see:
So the degree of the problem seems to worsen when there are multiple files handed to open_mfdataset though it's present with only one file to a smaller degree. So this is, thus far, the simplest reproducer I've come up with:
|
OK nice progress — so it does seem to be all in the |
It seems like it, but I don't know if there are weird performance gotchas with .load() as I don't use it in the production version of the code. But it seems like a weird regression on its own that open_mfdataset().load() would be that much slower and I assume it's at least tickling some of the same reasons. |
Two more pointers that may help here:
|
Ah, I think this is getting at it. With this simplified case, I went and checked all of the possible chunk settings again, and here, setting chunks='auto' does fix the regression.
With that result, I've now been able to rewrite the real code behind this example to also use chunks=auto by moving around some of the other processing. It's now performing in the same ballpark as it was under 0.8 (actually, in a nice outcome, it's faster, because part of the change I made was to use drop_variables to filter things at load time). Thank you! I'm not sure if this is a bug or if it's .. pure user error or if it's not intended that the default would have a performance glitch across 2023.08 -> 2023.09, but I'm grateful for the suggestions and help working through this! |
Could you show the "default" chunking behavior you were getting before setting Edit: |
xarray 2023.11.0, auto, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)}) xarray 2023.8.0, default, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)}) |
Right. So this all makes sense now:
The file you shared has an internal chunk shape of To the question of whether this is a bug, I don't think it is. The two relevant issues are GH1440 and PR7948. I believe you are getting the expected behavior, even if its sub optimal for your particular files. What would actually be quite nice is if Xarray did a bit more work for you here and created chunks that were clean multiples of the internal chunks (e.g. #8021). We have an open ticket for that already so I think this could be closed. |
Sounds good, and thank you both again. (I guess the question I had was about things changing from 2023.08, but I think the answer is, "I got lucky with 2023.08 and didn't have to think about the chunking of the original file, but I shouldn't have counted on that.") |
What happened?
I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr().
Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0.
One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage.
This regression coincides with #7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'.
I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes;
I'm using netcdf4 version 1.6.5.
The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file:
What did you expect to happen?
No response
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
xarray: 2023.8.0
pandas: 2.1.2
numpy: 1.26.1
scipy: 1.11.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: None
matplotlib: 3.8.1
cartopy: None
seaborn: 0.13.0
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.17.2
sphinx: None
The text was updated successfully, but these errors were encountered: