-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rewrite MultiZarrToZarr to always use xr.concat #33
Conversation
Great news @martindurant! Which datasets did you try so far? (I'll try the others) |
Only the noaa-nwm-retro-v2.0-pds 10-file example, which is "example_multi" in the hdf module. |
Can confirm that it works on my 1-day (144 files) of GOES netcdf data in Azure Blob storage. It took the processing time down from 55 minutes to 45 minutes, but only 25 of those minutes were CPU times so I'm not sure what's taking up the rest. |
That's the time to create a reference set for each file?
…On June 25, 2021 5:53:02 PM EDT, Lucas ***@***.***> wrote:
Can confirm that it works on my 1-day (144 files) of GOES netcdf data
in Azure Blob storage. It took the processing time down from 55 minutes
to 45 minutes, but only 25 of those minutes were CPU times so I'm not
sure what's taking up the rest.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#33 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
That's the time it takes to create a single reference with mzz = MultiZarrToZarr(
json_list,
# "zip://jsons/*.json::combined.zip",
remote_protocol='az',
remote_options={
'account_name' : 'goeseuwest'
},
xarray_open_kwargs={
'decode_cf' : False,
'mask_and_scale' : False,
'decode_times' : False,
'use_cftime' : False,
'decode_coords' : False,
},
xarray_concat_args={
"data_vars": "minimal",
"coords": "minimal",
"compat": "override",
"join": "override",
"combine_attrs": "override",
"dim": "t"
}
)
mzz.translate('combined.json') |
I'm thinking this might be something to do with the azure support in fsspec. #31 and fsspec/filesystem_spec#681 did help my access time, but opening a filesystem from a single reference JSON pointing at 144 netcdf files ( fs = fsspec.filesystem('reference', fo='combined.json',
remote_protocol='az', remote_options={'account_name':'goeseuwest'}) Is that expected for a dataset of this size? Each file is ~300 mb. |
Just opening the file only needs reading the JSON file. If you run a profile/snakeviz, you'll probably find that rendering the templates is taking a long time - what counts is the number of chunks, not the size of those chunks. Probably you can pass |
To the earlier question of why it takes so long to do the combination - you should turn on logging in adlfs or referenceFileSystem, to see what remote objects are being fetched. With the "minimal" and "override" options, I would have thought there are not too many, but this might be mistaken. The fact that CPU time is only a fraction of the total time suggests that there is significant contribution from latency, waiting for the remote server. Also note that the opening of the individual filesystems and datasets as input to the combine class could be done in parallel using dask, much as the I suggest in any case, that we merge this, so that we can iterate and move on with things such as the docs. |
@martindurant and @lsterzinger , yes, let's merge this and keep on testing. |
One optimisation I absultely should have mentioned is Finally, I have the comment in #25 that might be relevant: we make no attempt to join fetches with overlapping/contiguous/near-contiguous ranges. This would matter is several small pieces were being loaded from each of the files. |
@martindurant specifying How do I turn on logging for adlfs? I'm specifying I also ran |
"az" and "abfs" are the same |
ALL of the time is spent in rendering templates with jinja2. Do use |
Yeah I am using `simple_templates=True`. Not idea why it's still calling
jinja2 with it specified
|
@martindurant looks like it's calling jinja2 from |
At least the call stack says it's called from that function, but it seems like it would only go to jinja if there's |
Hah! It looks like the following line should simply not be there, it takes time and has no effect. Doubtless this is a bad merge. |
Okay yeah that makes sense. Should I open a PR to remove it? |
Removing that line brought the time down from ~20 minutes to just 16 seconds, I think you found the culprit! |
Yes please - once you've tried it
…On June 28, 2021 5:18:50 PM EDT, Lucas ***@***.***> wrote:
Okay yeah that makes sense. Should I open a PR to remove it?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#33 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
I can think of an even faster way to do it, if the 16s seems too long - if indeed most of the time is still spent in string substitution. |
@ lsterzinger @rsignell-usgs - this applies what we discussed, to make to xr.concat explicit and pass a separate set of options to it. If this feels OK, I'll update the docstrings then.
example_ensamble()
runs fine with this, quite fast (with the "minimal" and "override" arguments)