Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

Closed
3 of 5 tasks
dave-andersen opened this issue Nov 29, 2023 · 14 comments
Closed
3 of 5 tasks

Comments

@dave-andersen
Copy link

dave-andersen commented Nov 29, 2023

What happened?

I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr().

Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0.

One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage.

This regression coincides with #7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'.

I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes;
I'm using netcdf4 version 1.6.5.

The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file:

netcdf power_901_daily_20230101_merra2_utc {
dimensions:
	lon = 576 ;
	lat = 361 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:_FillValue = -999. ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:standard_name = "longitude" ;
	double lat(lat) ;
		lat:_FillValue = -999. ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:standard_name = "latitude" ;
	double time(time) ;
		time:_FillValue = -999. ;
		time:long_name = "time" ;
		time:units = "days since 2023-01-01 00:00:00" ;
...
        double T2M(time, lat, lon) ;
                T2M:_FillValue = -999. ;
                T2M:least_significant_digit = 2LL ;
                T2M:units = "K" ;
                T2M:long_name = "Temperature at 2 Meters" ;
                T2M:standard_name = "Temperature_at_2_Meters" ;
                T2M:valid_min = 150. ;
                T2M:valid_max = 350. ;
                T2M:valid_range = 150., 350. ;

What did you expect to happen?

No response

Minimal Complete Verifiable Example

desired_fields = ["PRECTOTCORR", "T2M"]
met_files = glob.glob("power_091_daily_*_merra2_utc.nc")
df = xarray.open_mfdataset(met_files, chunks={'time': 1},
                          parallel=False).astype(np.float32)
pt = df[desired_fields]
pt.to_zarr("out.zarr", consolidated=True)

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2023.8.0
pandas: 2.1.2
numpy: 1.26.1
scipy: 1.11.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: None
matplotlib: 3.8.1
cartopy: None
seaborn: 0.13.0
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.17.2
sphinx: None

@dave-andersen dave-andersen added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 29, 2023
Copy link

welcome bot commented Nov 29, 2023

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

@TomNicholas TomNicholas added topic-backends topic-performance regression and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 29, 2023
@max-sixty
Copy link
Collaborator

opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file.

Can we attribute on only the opening and not the writing?

@dave-andersen
Copy link
Author

dave-andersen commented Nov 29, 2023

Hmmmm. Now I'm uncertain of my benchmarking of just the open; I may have messed up. Re-checking, I think you're right, it's actually the write:

Times are open / write. m1/m2/m3 are three different months of data.

      2023.08.0      2023.11.0      2023.10.0

m1    3.29 / 0.62    3.35 / 5.95    3.4 / 6.05
m2    1.84 / 0.47    2.66 / 5.28    1.7 / 5.23
m3    2.07 / 0.52    1.98 / 5.87    2.02 / 5.85

The resulting zarr file is also larger with .09; for one month it's 37MB under .09 and 26MB under .08. Though there's some variability in the resulting output size.

@dave-andersen dave-andersen changed the title performance regression 2023.08 -> 2023.09 open_mfdataset on netcdf files over nfs performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset Nov 29, 2023
@max-sixty
Copy link
Collaborator

I know these are both important and difficult to produce a perfect MCVE (and generous of you to get this far already).

But I do think it's important to try and pin down exactly which part is slow — I think we can still minimize this case by a decent amount, even if it's not to something perfectly reproducible. If we produce a similarly shaped array and write that, has that also slowed down?

@dave-andersen
Copy link
Author

Nope. If I create a kinda sorta similar dataset synthetically using numpy and directly populate it as

ds = xarray.Dataset({"T2M": (["time", "lat", "lon"], temps),
                     "PRECTOTCORR": (["time", "lat", "lon"], precip)},
                    coords={"time": np.arange(grid[0]),
                            "lat": (np.arange(grid[1]) - 180.0)/4.0,
                            "lon": (np.arange(grid[2])/1.6 - 180.0)})

it writes in 0.07 seconds. (I assume some of what's happening is that it's only reading the original files during the write to zarr.)

If I force .load() to happen along with open_mfdataset, it gets a little weird. Under 2023.08 opening the first month takes 20 seconds and writing it takes 0.09. Under 2023.11, it took 161 seconds to open the first month (!), and then the write took 0.09 seconds. This still kinda suggests to me something weird is happening on open_mfdataset.

@dave-andersen
Copy link
Author

I've tried simplifying this in a few ways; none of the simplifications quite reflect the exact slowdown I'm seeing but they all have the direction the same. One test is opening a single POWER netcdf file and .load()ing it:

local xarray 2023.11.0 Elapsed: 4.925256
NFS xarray 2023.11.0 Elapsed: 4.680344
local xarray 2023.8.0 Elapsed: 0.919569
NFS xarray 2023.8.0 Elapsed: 0.645585
import xarray
import time

ncfile = "power_901_daily_20190801_merra2_utc.nc"

start = time.time()
df = xarray.open_mfdataset(ncfile).load()
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")

I've put a copy of two merra2 datafiles at

(they're each 16MB, wasn't sure github would appreciate that).

If I return it to the full but somewhat simplified example, I see times that aren't quite as bad as what I see with the multi-file example but it's still slower:

import xarray
import time

ncfile = "power_901_daily_20190801_merra2_utc.nc"

start = time.time()
df = xarray.open_mfdataset(ncfile)
df = df[["T2M", "PRECTOTCORR"]]
df.to_zarr("output.zarr", consolidated=True)
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")
local xarray 2023.11.0 Elapsed: 0.518345
local xarray 2023.8.0 Elapsed: 0.438258
nfs xarray 2023.11.0 Elapsed: 0.613554
nfs xarray 2023.8.0 Elapsed: 0.484871

And if I use two .nc files as the source, I see:

local 2023.8.0 Elapsed: 0.540353
nfs xarray 2023.8.0 Elapsed: 0.599314
local xarray 2023.11.0 Elapsed: 0.904065
nfs xarray 2023.11.0 Elapsed: 0.952745

So the degree of the problem seems to worsen when there are multiple files handed to open_mfdataset though it's present with only one file to a smaller degree. So this is, thus far, the simplest reproducer I've come up with:

import xarray
import time

ncfiles = ["power_901_daily_20190801_merra2_utc.nc", "power_901_daily_20190802_merra2_utc.nc"]

start = time.time()
df = xarray.open_mfdataset(ncfiles)
df = df[["T2M", "PRECTOTCORR"]]
df.to_zarr("output.zarr", consolidated=True)
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")

@max-sixty
Copy link
Collaborator

One test is opening a single POWER netcdf file and .load()ing it:

OK nice progress — so it does seem to be all in the .load() — am I reading that correctly? We don't need .to_zarr to show the problem?

@dave-andersen
Copy link
Author

It seems like it, but I don't know if there are weird performance gotchas with .load() as I don't use it in the production version of the code. But it seems like a weird regression on its own that open_mfdataset().load() would be that much slower and I assume it's at least tickling some of the same reasons.

@jhamman
Copy link
Member

jhamman commented Nov 30, 2023

Two more pointers that may help here:

  1. Remove dask from the open_mfdataset case by setting chunks=False (e.g. open_mfdataset(ncfiles, chunks=False).
  2. Drop encoding directly following each of your open_dataset or open_mfdataset calls e.g. open_mfdataset(ncfiles, chunks=False).reset_encoding() )

@dave-andersen
Copy link
Author

Ah, I think this is getting at it. With this simplified case, I went and checked all of the possible chunk settings again, and here, setting chunks='auto' does fix the regression.

xarray 2023.8.0 Elapsed: 0.534117 - chunks={'time': 1} and most other configurations

xarray 2023.11.0 Elapsed: 0.571602 - chunks='auto'
xarray 2023.11.0 Elapsed: 0.905893 - chunks={'time': 1}
xarray 2023.11.0 Elapsed: 0.856743 - chunks={'time':1}, reset_encoding
xarray 2023.11.0 Elapsed: 0.933716 - chunks=False
xarray 2023.11.0 Elapsed: 0.934985 - chunks={}
xarray 2023.11.0 Elapsed: 0.873481 - chunks=False, reset_encoding

With that result, I've now been able to rewrite the real code behind this example to also use chunks=auto by moving around some of the other processing. It's now performing in the same ballpark as it was under 0.8 (actually, in a nice outcome, it's faster, because part of the change I made was to use drop_variables to filter things at load time).

Thank you! I'm not sure if this is a bug or if it's .. pure user error or if it's not intended that the default would have a performance glitch across 2023.08 -> 2023.09, but I'm grateful for the suggestions and help working through this!

@jhamman
Copy link
Member

jhamman commented Nov 30, 2023

Could you show the "default" chunking behavior you were getting before setting chunks in open_mfdataset? To understand if there is a bug here, I'd like to understand what chunking you were getting in the default settings.

Edit: ds.chunks should be sufficient here.

@dave-andersen
Copy link
Author

xarray 2023.11.0, auto, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.11.0, default, DF chunks Frozen({'time': (1, 1), 'lat': (36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 1), 'lon': (72, 72, 72, 72, 72, 72, 72, 72)})

xarray 2023.8.0, default, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.8.0, auto, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.8.0, chunks={'time': 1}, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})

@jhamman
Copy link
Member

jhamman commented Nov 30, 2023

Right. So this all makes sense now:

ncdump -hs ~/Downloads/power_901_daily_20190801_merra2_utc.nc
netcdf power_901_daily_20190801_merra2_utc {
dimensions:
	lon = 576 ;
	lat = 361 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:_FillValue = -999. ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:standard_name = "longitude" ;
		lon:_Storage = "contiguous" ;
		lon:_Endianness = "little" ;
	double lat(lat) ;
		lat:_FillValue = -999. ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:standard_name = "latitude" ;
		lat:_Storage = "contiguous" ;
		lat:_Endianness = "little" ;
	double time(time) ;
		time:_FillValue = -999. ;
		time:long_name = "time" ;
		time:units = "days since 2019-08-01 00:00:00" ;
		time:_Storage = "chunked" ;
		time:_ChunkSizes = 512 ;
		time:_Endianness = "little" ;
	double CDD10(time, lat, lon) ;
		CDD10:_FillValue = -999. ;
		CDD10:least_significant_digit = 1LL ;
		CDD10:units = "degree-day" ;
		CDD10:long_name = "Cooling Degree Days Above 10 C" ;
		CDD10:standard_name = "Cooling_Degree_Days_Above_10_C" ;
		CDD10:_Storage = "chunked" ;
		CDD10:_ChunkSizes = 1, 36, 72 ;
		CDD10:_Shuffle = "true" ;
		CDD10:_DeflateLevel = 4 ;
		CDD10:_Endianness = "little" ;

The file you shared has an internal chunk shape of CDD10:_ChunkSizes = 1, 36, 72 ;. These are very small chunks and are not efficient for Dask to work with. So I'm not surprised you get sub optimal behavior with these.

To the question of whether this is a bug, I don't think it is. The two relevant issues are GH1440 and PR7948. I believe you are getting the expected behavior, even if its sub optimal for your particular files.

What would actually be quite nice is if Xarray did a bit more work for you here and created chunks that were clean multiples of the internal chunks (e.g. #8021). We have an open ticket for that already so I think this could be closed.

@dave-andersen
Copy link
Author

dave-andersen commented Nov 30, 2023

Sounds good, and thank you both again. (I guess the question I had was about things changing from 2023.08, but I think the answer is, "I got lucky with 2023.08 and didn't have to think about the chunking of the original file, but I shouldn't have counted on that.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants