performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

dave-andersen · 2023-11-29T15:40:24Z

What happened?

I'm probably doing something wrong, but I'm seeing a large performance regression from 08 to 09 when opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file. Updated, see comments below: The speed difference is actually most apparent on the call to .to_zarr().

Deleted: The regression is apparent just in the time to call open_mfdataset; this operation on a month worth of files went from about 3.5 seconds to 9 seconds between these two versions, and remains slow even with 2023.11.0.

One thing about my setup is that I'm reading the source files over NFS; the output zarr file is going to local fast temporary storage.

This regression coincides with #7948 which changed the chunking for netcdf4 files, but I'm not sure if that's the cause. The performance doesn't change if I use chunks={} or chunks='auto'.

I've tried this with dask 2023.08.0 through 2023.11.0 and there are no changes;
I'm using netcdf4 version 1.6.5.

The merra2 files are all lat/lon gridded and each represents a single day; I'm re-writing them to put multiple days in a one-month file:

netcdf power_901_daily_20230101_merra2_utc {
dimensions:
	lon = 576 ;
	lat = 361 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:_FillValue = -999. ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:standard_name = "longitude" ;
	double lat(lat) ;
		lat:_FillValue = -999. ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:standard_name = "latitude" ;
	double time(time) ;
		time:_FillValue = -999. ;
		time:long_name = "time" ;
		time:units = "days since 2023-01-01 00:00:00" ;
...
        double T2M(time, lat, lon) ;
                T2M:_FillValue = -999. ;
                T2M:least_significant_digit = 2LL ;
                T2M:units = "K" ;
                T2M:long_name = "Temperature at 2 Meters" ;
                T2M:standard_name = "Temperature_at_2_Meters" ;
                T2M:valid_min = 150. ;
                T2M:valid_max = 350. ;
                T2M:valid_range = 150., 350. ;

What did you expect to happen?

No response

Minimal Complete Verifiable Example

desired_fields = ["PRECTOTCORR", "T2M"]
met_files = glob.glob("power_091_daily_*_merra2_utc.nc")
df = xarray.open_mfdataset(met_files, chunks={'time': 1},
                          parallel=False).astype(np.float32)
pt = df[desired_fields]
pt.to_zarr("out.zarr", consolidated=True)

MVCE confirmation

Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
Complete example — the example is self-contained, including all data and the text of any traceback.
Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
New issue — a search of GitHub Issues suggests this is not a duplicate.
Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

No response

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 5.19.0-35-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: 1.12.2 libnetcdf: 4.9.3-development

xarray: 2023.8.0
pandas: 2.1.2
numpy: 1.26.1
scipy: 1.11.3
netCDF4: 1.6.5
pydap: None
h5netcdf: 1.2.0
h5py: 3.10.0
Nio: None
zarr: 2.16.1
cftime: 1.6.3
nc_time_axis: None
PseudoNetCDF: None
iris: None
bottleneck: None
dask: 2023.11.0
distributed: None
matplotlib: 3.8.1
cartopy: None
seaborn: 0.13.0
numbagg: None
fsspec: 2023.10.0
cupy: None
pint: None
sparse: None
flox: None
numpy_groupies: None
setuptools: 68.2.2
pip: 23.3.1
conda: None
pytest: 7.4.3
mypy: None
IPython: 8.17.2
sphinx: None

The text was updated successfully, but these errors were encountered:

welcome · 2023-11-29T15:40:27Z

Thanks for opening your first issue here at xarray! Be sure to follow the issue template!
If you have an idea for a solution, we would really welcome a Pull Request with proposed changes.
See the Contributing Guide for more.
It may take us a while to respond here, but we really value your contribution. Contributors like you help make xarray better.
Thank you!

max-sixty · 2023-11-29T17:41:09Z

opening a set of NASA POWER netcdf files, reducing them to only a subset of variables, and then saving them as a zarr file.

Can we attribute on only the opening and not the writing?

dave-andersen · 2023-11-29T18:29:42Z

Hmmmm. Now I'm uncertain of my benchmarking of just the open; I may have messed up. Re-checking, I think you're right, it's actually the write:

Times are open / write. m1/m2/m3 are three different months of data.

      2023.08.0      2023.11.0      2023.10.0

m1    3.29 / 0.62    3.35 / 5.95    3.4 / 6.05
m2    1.84 / 0.47    2.66 / 5.28    1.7 / 5.23
m3    2.07 / 0.52    1.98 / 5.87    2.02 / 5.85

The resulting zarr file is also larger with .09; for one month it's 37MB under .09 and 26MB under .08. Though there's some variability in the resulting output size.

max-sixty · 2023-11-29T18:44:05Z

I know these are both important and difficult to produce a perfect MCVE (and generous of you to get this far already).

But I do think it's important to try and pin down exactly which part is slow — I think we can still minimize this case by a decent amount, even if it's not to something perfectly reproducible. If we produce a similarly shaped array and write that, has that also slowed down?

dave-andersen · 2023-11-29T19:37:17Z

Nope. If I create a kinda sorta similar dataset synthetically using numpy and directly populate it as

ds = xarray.Dataset({"T2M": (["time", "lat", "lon"], temps),
                     "PRECTOTCORR": (["time", "lat", "lon"], precip)},
                    coords={"time": np.arange(grid[0]),
                            "lat": (np.arange(grid[1]) - 180.0)/4.0,
                            "lon": (np.arange(grid[2])/1.6 - 180.0)})

it writes in 0.07 seconds. (I assume some of what's happening is that it's only reading the original files during the write to zarr.)

If I force .load() to happen along with open_mfdataset, it gets a little weird. Under 2023.08 opening the first month takes 20 seconds and writing it takes 0.09. Under 2023.11, it took 161 seconds to open the first month (!), and then the write took 0.09 seconds. This still kinda suggests to me something weird is happening on open_mfdataset.

dave-andersen · 2023-11-29T22:47:05Z

I've tried simplifying this in a few ways; none of the simplifications quite reflect the exact slowdown I'm seeing but they all have the direction the same. One test is opening a single POWER netcdf file and .load()ing it:

local xarray 2023.11.0 Elapsed: 4.925256
NFS xarray 2023.11.0 Elapsed: 4.680344
local xarray 2023.8.0 Elapsed: 0.919569
NFS xarray 2023.8.0 Elapsed: 0.645585

import xarray
import time

ncfile = "power_901_daily_20190801_merra2_utc.nc"

start = time.time()
df = xarray.open_mfdataset(ncfile).load()
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")

I've put a copy of two merra2 datafiles at

(they're each 16MB, wasn't sure github would appreciate that).

If I return it to the full but somewhat simplified example, I see times that aren't quite as bad as what I see with the multi-file example but it's still slower:

import xarray
import time

ncfile = "power_901_daily_20190801_merra2_utc.nc"

start = time.time()
df = xarray.open_mfdataset(ncfile)
df = df[["T2M", "PRECTOTCORR"]]
df.to_zarr("output.zarr", consolidated=True)
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")

local xarray 2023.11.0 Elapsed: 0.518345
local xarray 2023.8.0 Elapsed: 0.438258
nfs xarray 2023.11.0 Elapsed: 0.613554
nfs xarray 2023.8.0 Elapsed: 0.484871

And if I use two .nc files as the source, I see:

local 2023.8.0 Elapsed: 0.540353
nfs xarray 2023.8.0 Elapsed: 0.599314
local xarray 2023.11.0 Elapsed: 0.904065
nfs xarray 2023.11.0 Elapsed: 0.952745

So the degree of the problem seems to worsen when there are multiple files handed to open_mfdataset though it's present with only one file to a smaller degree. So this is, thus far, the simplest reproducer I've come up with:

import xarray
import time

ncfiles = ["power_901_daily_20190801_merra2_utc.nc", "power_901_daily_20190802_merra2_utc.nc"]

start = time.time()
df = xarray.open_mfdataset(ncfiles)
df = df[["T2M", "PRECTOTCORR"]]
df.to_zarr("output.zarr", consolidated=True)
end = time.time()
print(f"xarray {xarray.__version__} Elapsed: {end-start:02f}")

max-sixty · 2023-11-30T01:00:51Z

One test is opening a single POWER netcdf file and .load()ing it:

OK nice progress — so it does seem to be all in the .load() — am I reading that correctly? We don't need .to_zarr to show the problem?

dave-andersen · 2023-11-30T02:45:46Z

It seems like it, but I don't know if there are weird performance gotchas with .load() as I don't use it in the production version of the code. But it seems like a weird regression on its own that open_mfdataset().load() would be that much slower and I assume it's at least tickling some of the same reasons.

jhamman · 2023-11-30T02:46:39Z

Two more pointers that may help here:

Remove dask from the open_mfdataset case by setting chunks=False (e.g. open_mfdataset(ncfiles, chunks=False).
Drop encoding directly following each of your open_dataset or open_mfdataset calls e.g. open_mfdataset(ncfiles, chunks=False).reset_encoding() )

dave-andersen · 2023-11-30T03:08:16Z

Ah, I think this is getting at it. With this simplified case, I went and checked all of the possible chunk settings again, and here, setting chunks='auto' does fix the regression.

xarray 2023.8.0 Elapsed: 0.534117 - chunks={'time': 1} and most other configurations

xarray 2023.11.0 Elapsed: 0.571602 - chunks='auto'
xarray 2023.11.0 Elapsed: 0.905893 - chunks={'time': 1}
xarray 2023.11.0 Elapsed: 0.856743 - chunks={'time':1}, reset_encoding
xarray 2023.11.0 Elapsed: 0.933716 - chunks=False
xarray 2023.11.0 Elapsed: 0.934985 - chunks={}
xarray 2023.11.0 Elapsed: 0.873481 - chunks=False, reset_encoding

With that result, I've now been able to rewrite the real code behind this example to also use chunks=auto by moving around some of the other processing. It's now performing in the same ballpark as it was under 0.8 (actually, in a nice outcome, it's faster, because part of the change I made was to use drop_variables to filter things at load time).

Thank you! I'm not sure if this is a bug or if it's .. pure user error or if it's not intended that the default would have a performance glitch across 2023.08 -> 2023.09, but I'm grateful for the suggestions and help working through this!

jhamman · 2023-11-30T03:16:59Z

Could you show the "default" chunking behavior you were getting before setting chunks in open_mfdataset? To understand if there is a bug here, I'd like to understand what chunking you were getting in the default settings.

Edit: ds.chunks should be sufficient here.

dave-andersen · 2023-11-30T03:23:56Z

xarray 2023.11.0, auto, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.11.0, default, DF chunks Frozen({'time': (1, 1), 'lat': (36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 1), 'lon': (72, 72, 72, 72, 72, 72, 72, 72)})

xarray 2023.8.0, default, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.8.0, auto, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})
xarray 2023.8.0, chunks={'time': 1}, DF chunks Frozen({'time': (1, 1), 'lat': (361,), 'lon': (576,)})

jhamman · 2023-11-30T03:31:26Z

Right. So this all makes sense now:

ncdump -hs ~/Downloads/power_901_daily_20190801_merra2_utc.nc
netcdf power_901_daily_20190801_merra2_utc {
dimensions:
	lon = 576 ;
	lat = 361 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:_FillValue = -999. ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
		lon:standard_name = "longitude" ;
		lon:_Storage = "contiguous" ;
		lon:_Endianness = "little" ;
	double lat(lat) ;
		lat:_FillValue = -999. ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
		lat:standard_name = "latitude" ;
		lat:_Storage = "contiguous" ;
		lat:_Endianness = "little" ;
	double time(time) ;
		time:_FillValue = -999. ;
		time:long_name = "time" ;
		time:units = "days since 2019-08-01 00:00:00" ;
		time:_Storage = "chunked" ;
		time:_ChunkSizes = 512 ;
		time:_Endianness = "little" ;
	double CDD10(time, lat, lon) ;
		CDD10:_FillValue = -999. ;
		CDD10:least_significant_digit = 1LL ;
		CDD10:units = "degree-day" ;
		CDD10:long_name = "Cooling Degree Days Above 10 C" ;
		CDD10:standard_name = "Cooling_Degree_Days_Above_10_C" ;
		CDD10:_Storage = "chunked" ;
		CDD10:_ChunkSizes = 1, 36, 72 ;
		CDD10:_Shuffle = "true" ;
		CDD10:_DeflateLevel = 4 ;
		CDD10:_Endianness = "little" ;

The file you shared has an internal chunk shape of CDD10:_ChunkSizes = 1, 36, 72 ;. These are very small chunks and are not efficient for Dask to work with. So I'm not surprised you get sub optimal behavior with these.

To the question of whether this is a bug, I don't think it is. The two relevant issues are GH1440 and PR7948. I believe you are getting the expected behavior, even if its sub optimal for your particular files.

What would actually be quite nice is if Xarray did a bit more work for you here and created chunks that were clean multiples of the internal chunks (e.g. #8021). We have an open ticket for that already so I think this could be closed.

dave-andersen · 2023-11-30T03:32:57Z

Sounds good, and thank you both again. (I guess the question I had was about things changing from 2023.08, but I think the answer is, "I got lucky with 2023.08 and didn't have to think about the chunking of the original file, but I shouldn't have counted on that.")

dave-andersen added bug needs triage Issue that has not been reviewed by xarray team member labels Nov 29, 2023

TomNicholas added topic-backends topic-performance regression and removed needs triage Issue that has not been reviewed by xarray team member labels Nov 29, 2023

dave-andersen changed the title ~~performance regression 2023.08 -> 2023.09 open_mfdataset on netcdf files over nfs~~ performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset Nov 29, 2023

dave-andersen closed this as completed Nov 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

dave-andersen commented Nov 29, 2023 •

edited

Loading

welcome bot commented Nov 29, 2023

max-sixty commented Nov 29, 2023

dave-andersen commented Nov 29, 2023 •

edited

Loading

max-sixty commented Nov 29, 2023

dave-andersen commented Nov 29, 2023

dave-andersen commented Nov 29, 2023

max-sixty commented Nov 30, 2023

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023 •

edited

Loading

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023

dave-andersen commented Nov 30, 2023 •

edited

Loading

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

performance regression 2023.08 -> 2023.09 to_zarr from netcdf4 open_mfdataset #8490

Comments

dave-andersen commented Nov 29, 2023 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

MVCE confirmation

Relevant log output

Anything else we need to know?

Environment

welcome bot commented Nov 29, 2023

max-sixty commented Nov 29, 2023

dave-andersen commented Nov 29, 2023 • edited Loading

max-sixty commented Nov 29, 2023

dave-andersen commented Nov 29, 2023

dave-andersen commented Nov 29, 2023

max-sixty commented Nov 30, 2023

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023 • edited Loading

dave-andersen commented Nov 30, 2023

jhamman commented Nov 30, 2023

dave-andersen commented Nov 30, 2023 • edited Loading

dave-andersen commented Nov 29, 2023 •

edited

Loading

dave-andersen commented Nov 29, 2023 •

edited

Loading

jhamman commented Nov 30, 2023 •

edited

Loading

dave-andersen commented Nov 30, 2023 •

edited

Loading