rewrite MultiZarrToZarr to always use xr.concat #33

martindurant · 2021-06-24T01:58:46Z

@ lsterzinger @rsignell-usgs - this applies what we discussed, to make to xr.concat explicit and pass a separate set of options to it. If this feels OK, I'll update the docstrings then.
example_ensamble() runs fine with this, quite fast (with the "minimal" and "override" arguments)

rsignell-usgs · 2021-06-24T14:10:31Z

Great news @martindurant! Which datasets did you try so far? (I'll try the others)

martindurant · 2021-06-24T14:12:31Z

Only the noaa-nwm-retro-v2.0-pds 10-file example, which is "example_multi" in the hdf module.

lsterzinger · 2021-06-25T21:52:53Z

Can confirm that it works on my 1-day (144 files) of GOES netcdf data in Azure Blob storage. It took the processing time down from 55 minutes to 45 minutes, but only 25 of those minutes were CPU times so I'm not sure what's taking up the rest.

martindurant · 2021-06-26T00:27:21Z

That's the time to create a reference set for each file?

…

On June 25, 2021 5:53:02 PM EDT, Lucas ***@***.***> wrote: Can confirm that it works on my 1-day (144 files) of GOES netcdf data in Azure Blob storage. It took the processing time down from 55 minutes to 45 minutes, but only 25 of those minutes were CPU times so I'm not sure what's taking up the rest. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

lsterzinger · 2021-06-27T17:26:07Z

That's the time it takes to create a single reference with MultiZarrToZarr.translate() from 144 existing reference jsons, I'm using

mzz = MultiZarrToZarr(
    json_list,
#     "zip://jsons/*.json::combined.zip",
    remote_protocol='az',
    remote_options={
       'account_name' : 'goeseuwest'
    },    
    xarray_open_kwargs={
        'decode_cf' : False,
        'mask_and_scale' : False,
        'decode_times' : False,
        'use_cftime' : False,
        'decode_coords' : False,
    },
    xarray_concat_args={
        "data_vars": "minimal",
        "coords": "minimal",
        "compat": "override",
        "join": "override",
        "combine_attrs": "override",
        "dim": "t"

    }
)

mzz.translate('combined.json')

lsterzinger · 2021-06-28T14:50:38Z

I'm thinking this might be something to do with the azure support in fsspec. #31 and fsspec/filesystem_spec#681 did help my access time, but opening a filesystem from a single reference JSON pointing at 144 netcdf files (combined.json, created using the code above) takes 17min using

fs =  fsspec.filesystem('reference', fo='combined.json', 
                      remote_protocol='az', remote_options={'account_name':'goeseuwest'})

Is that expected for a dataset of this size? Each file is ~300 mb.

martindurant · 2021-06-28T14:53:56Z

Just opening the file only needs reading the JSON file. If you run a profile/snakeviz, you'll probably find that rendering the templates is taking a long time - what counts is the number of chunks, not the size of those chunks. Probably you can pass simple_templates=True to significantly improve the time. Also, do you have ujson installed?

martindurant · 2021-06-28T15:37:34Z

To the earlier question of why it takes so long to do the combination - you should turn on logging in adlfs or referenceFileSystem, to see what remote objects are being fetched. With the "minimal" and "override" options, I would have thought there are not too many, but this might be mistaken. The fact that CPU time is only a fraction of the total time suggests that there is significant contribution from latency, waiting for the remote server.

Also note that the opening of the individual filesystems and datasets as input to the combine class could be done in parallel using dask, much as the parallel= option does in xarray.open_mfdataset.

I suggest in any case, that we merge this, so that we can iterate and move on with things such as the docs.

rsignell-usgs · 2021-06-28T15:39:35Z

@martindurant and @lsterzinger , yes, let's merge this and keep on testing.

martindurant · 2021-06-28T15:55:40Z

One optimisation I absultely should have mentioned is inline_threshold in SingleHdf5ToZarr. It's hard to know what the best value here is, it's a trade-off between small reads and inflating the output JSON, but I've typically been using 100-500bytes.

Finally, I have the comment in #25 that might be relevant: we make no attempt to join fetches with overlapping/contiguous/near-contiguous ranges. This would matter is several small pieces were being loaded from each of the files.

lsterzinger · 2021-06-28T18:35:31Z

@martindurant specifying simple_templates=True in fsspec.filesystem() didn't seem to do much.

How do I turn on logging for adlfs? I'm specifying remote_protocol='az' instead of 'abfs', is there a difference between the two? It seems like they're both using adlfs.

I also ran fsspec.filesystem() with a profiler, but I'm 100% confident in my ability to interpret the results. I've attached the profile file here, I was able to visualize it with snakeviz ./profile

profile.zip

martindurant · 2021-06-28T18:37:58Z

"az" and "abfs" are the same

martindurant · 2021-06-28T19:28:35Z

ALL of the time is spent in rendering templates with jinja2. Do use simple_templates=True, this should avoid jinja entirely, I don't know why it's not making a difference for you.
https://github.com/intake/filesystem_spec/blob/master/fsspec/implementations/reference.py#L229

lsterzinger · 2021-06-28T20:31:10Z

Yeah I am using `simple_templates=True`. Not idea why it's still calling jinja2 with it specified

lsterzinger · 2021-06-28T21:10:12Z

@martindurant looks like it's calling jinja2 from _process_references1() where jinja is used to render regardless

https://github.com/intake/filesystem_spec/blob/1e5263a3f38af4ba64a5dbaf0414707ed937826d/fsspec/implementations/reference.py#L286

lsterzinger · 2021-06-28T21:12:48Z

At least the call stack says it's called from that function, but it seems like it would only go to jinja if there's '{{' in u

martindurant · 2021-06-28T21:14:04Z

Hah! It looks like the following line should simply not be there, it takes time and has no effect. Doubtless this is a bad merge.

lsterzinger · 2021-06-28T21:18:41Z

Okay yeah that makes sense. Should I open a PR to remove it?

lsterzinger · 2021-06-28T21:30:38Z

Removing that line brought the time down from ~20 minutes to just 16 seconds, I think you found the culprit!

martindurant · 2021-06-28T21:31:24Z

Yes please - once you've tried it

…

On June 28, 2021 5:18:50 PM EDT, Lucas ***@***.***> wrote: Okay yeah that makes sense. Should I open a PR to remove it? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

martindurant · 2021-06-29T12:46:38Z

I can think of an even faster way to do it, if the 16s seems too long - if indeed most of the time is still spent in string substitution.

rewrite MultiZarrToZarr to always use xr.concat

1112247

martindurant merged commit 67ccf71 into fsspec:main Jun 28, 2021

martindurant deleted the combine_api branch June 28, 2021 15:40

lsterzinger mentioned this pull request Jun 28, 2021

Removed unecessary call to jinja2.Template.render() in Reference.py fsspec/filesystem_spec#687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rewrite MultiZarrToZarr to always use xr.concat #33

rewrite MultiZarrToZarr to always use xr.concat #33

martindurant commented Jun 24, 2021 •

edited

Loading

rsignell-usgs commented Jun 24, 2021

martindurant commented Jun 24, 2021

lsterzinger commented Jun 25, 2021

martindurant commented Jun 26, 2021 via email

lsterzinger commented Jun 27, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

martindurant commented Jun 28, 2021

rsignell-usgs commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021 via email •

edited

Loading

lsterzinger commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021 via email

martindurant commented Jun 29, 2021

rewrite MultiZarrToZarr to always use xr.concat #33

rewrite MultiZarrToZarr to always use xr.concat #33

Conversation

martindurant commented Jun 24, 2021 • edited Loading

rsignell-usgs commented Jun 24, 2021

martindurant commented Jun 24, 2021

lsterzinger commented Jun 25, 2021

martindurant commented Jun 26, 2021 via email

lsterzinger commented Jun 27, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

martindurant commented Jun 28, 2021

rsignell-usgs commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021 via email • edited Loading

lsterzinger commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

lsterzinger commented Jun 28, 2021

martindurant commented Jun 28, 2021 via email

martindurant commented Jun 29, 2021

martindurant commented Jun 24, 2021 •

edited

Loading

lsterzinger commented Jun 28, 2021 via email •

edited

Loading