-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs on specifying chunks in to_zarr encoding arg #6542
Conversation
The structure of the to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!
for more information, see https://pre-commit.ci
This looks very well-written @delgadom , thank you! I'll let someone who knows the content better do a review too. |
hmmm seems I've messed something up in the docs build. apologize for the churn - will fix. |
sorry I know I said I'd fix this but I'm having a very hard time figuring out what is wrong and how to build the docs. I had to set up a docker image to build them because I couldn't get the ipython directive to use the right conda env on my laptop, and now I'm getting a version error when pandas encounters the default build number on my fork. I'm a bit embarrassed that I can't figure this out, but... I think I might need a hand getting this across the finish line 😕
|
Does this part of the docs help |
looking at the reported xarray version, i'm curious... 🧐 are you using a shallow git clone of xarray? I'm able to reproduce the version issue via these steps: git clone --depth 1 [email protected]:pydata/xarray.git
cd xarray
python -m pip install -e . conda list xarray ─╯
# packages in environment at /Users/andersy005/mambaforge/envs/test:
#
# Name Version Build Channel
xarray 0.1.dev1+g126051f dev_0 <develop> |
!!! @andersy005 thank you so much! yes - I was using a shallow clone inside the docker version of the build. I really appreciate the review and for catching my error. I'll clean this up and push the changes. |
@dcherian - yep I was trying to follow that guide closely but was still struggling with building using conda on my laptop's miniconda environment. The sphinx ipython directive kept running in the wrong conda environment, even after deleting sphinx, ipython, and ipykernel on my base env and making sure the env I was running |
Co-authored-by: Anderson Banihirwe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fantastic! Thanks so much @delgadom.
Reading these docs make me think that perhaps we should consider changing the defaults to something more intuitive. But we can address that in the future.
now I'm getting reproducible build timeouts 😭 The docs build alone (just Thanks for all the help everyone! |
FWIW there's a similar doc page about chunk size in Dask that may be worth borrowing from |
That's not fun at all. FWIW most seem to pass, so you may have been unlucky. I restarted the build by merging main into the branch. Let's hope this completes. Thanks a lot for both the contribution and your patience @delgadom ... |
thank you all for being patient with this PR! seems the build failed again for the same reason. I think there might be something wrong with my examples, though it beats me what the issue is. As far as I can tell, most builds come in somewhere in the mid 900s range on readthedocs but my branch consistently times out at 1900s. I'll see if I can muster a bit of time to run through the exact rtd build workflow and figure out what's going on but probably won't get to it until this weekend. |
@jakirkham were you thinking a reference to the dask docs for more info on optimal chunk sizing and aligning with storage? or are you suggesting the proposed docs change is too complex? I was trying to address the lack of documentation on specifying chunks within a zarr array for non-dask arrays/coordinates, but also covering the weedsy (but common) case of datasets with a mix of dask & in-memory arrays/coords like in my example. I have been frustrated by zarr stores I've written with a couple dozen array chunks and thousands of coordinate chunks for this reason, but it's definitely a gnarly topic to cover concisely :P |
It could make sense to refer to or if similar ideas come up here it may be worth mentioning in this change
Not at all.
If there's anything you need help with or would like to discuss, please don't hesitate to raise a Zarr issue. We also enabled GH discussions over there so if that fits better feel free to use that 🙂 |
|
||
.. ipython:: python | ||
|
||
ds = xr.tutorial.open_dataset("rasm") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this an idealized example to avoid having to download the dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be wrong, but don't we use this elsewhere, too? In that case, pooch
shouldn't re-download this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what I thought - it's definitely used in a handful of places already, including within an ipython
directive in the sphinx docs (see e.g. zarr encoding specification).
@jakirkham appreciate that! And just to clarify my confusion/frustration in the past was simply around the issue that I'm documenting here, which I've finally figured out! so hopefully this will help resolve the problem for future users. I agree with Ryan that there might be some changes in the defaults that could be helpful here, though setting intuitive defaults other than zarr's defaults could get messy for complex datasets with a mix of dask and in-memory arrays. But I think that's all on the xarray side? |
Thank you, @delgadom! |
@andersy005 last I saw this was still not building on readthedocs! I never figured out how to get around the build timeout are you sure this PR was good to go? |
my bad :) thank you for reporting this in #6720. I'm going to look into what;s going on |
* main: (129 commits) docs on specifying chunks in to_zarr encoding arg (pydata#6542) [skip-ci] List count under Aggregation (pydata#6711) Add `Dataset.dtypes` property (pydata#6706) try to import `UndefinedVariableError` from the new location (pydata#6701) DOC: note of how `chunks` can be defined (pydata#6696) pdyap version dependent client.open_url call (pydata#6656) use `pytest-reportlog` to generate upstream-dev CI failure reports (pydata#6699) [pre-commit.ci] pre-commit autoupdate (pydata#6694) Bump actions/setup-python from 3 to 4 (pydata#6692) Fix Dataset.where with drop=True and mixed dims (pydata#6690) pass kwargs through from save_mfdataset to to_netcdf (pydata#6686) Docs: indexing.rst finetuning (pydata#6685) use micromamba instead of mamba (pydata#6674) install the development version of `matplotlib` into the upstream-dev CI (pydata#6675) Add whatsnew section for v2022.06.0 release notes for 2022.06.0rc0 release notes for the pre-release (pydata#6676) more testpypi workflow fixes (pydata#6673) thin: add examples (pydata#6663) Update multidimensional-coords.ipynb (pydata#6672) ...
The structure of the
Dataset.to_zarr
encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!Documentation additions only (no workflow stages)