Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs on specifying chunks in to_zarr encoding arg #6542

Merged
merged 11 commits into from
Jun 23, 2022

Conversation

delgadom
Copy link
Contributor

The structure of the Dataset.to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!

Documentation additions only (no workflow stages)

delgadom and others added 2 commits April 29, 2022 11:36
The structure of the to_zarr encoding argument is particular to xarray (at least, it's not immediately obvious from the zarr docs how this argument gets parsed) and it took a bit of trial and error to figure out out the rules. Hoping this docs block is helpful to others!
@max-sixty
Copy link
Collaborator

This looks very well-written @delgadom , thank you!

I'll let someone who knows the content better do a review too.

@delgadom
Copy link
Contributor Author

hmmm seems I've messed something up in the docs build. apologize for the churn - will fix.

@delgadom
Copy link
Contributor Author

delgadom commented May 1, 2022

sorry I know I said I'd fix this but I'm having a very hard time figuring out what is wrong and how to build the docs. I had to set up a docker image to build them because I couldn't get the ipython directive to use the right conda env on my laptop, and now I'm getting a version error when pandas encounters the default build number on my fork. I'm a bit embarrassed that I can't figure this out, but... I think I might need a hand getting this across the finish line 😕

....
#13 1597.6 reading sources... [ 97%] getting-started-guide/faq
#13 1600.8 reading sources... [ 97%] getting-started-guide/index
#13 1600.8 reading sources... [ 97%] getting-started-guide/installing
#13 1601.0 reading sources... [ 97%] getting-started-guide/quick-overview
#13 1609.3 WARNING:
#13 1609.3 >>>-------------------------------------------------------------------------
#13 1609.3 Exception in /xarray/doc/getting-started-guide/quick-overview.rst at block ending on line 169
#13 1609.3 Specify :okexcept: as an option in the ipython:: block to suppress this message
#13 1609.3 ---------------------------------------------------------------------------
#13 1609.3 ImportError                               Traceback (most recent call last)
#13 1609.3 Input In [40], in <cell line: 1>()
#13 1609.3 ----> 1 series.to_xarray()
#13 1609.3
#13 1609.3 File /srv/conda/envs/xarray-doc/lib/python3.9/site-packages/pandas/core/generic.py:3173, in NDFrame.to_xarray(self)
#13 1609.3    3096 @final
#13 1609.3    3097 def to_xarray(self):
#13 1609.3    3098     """
#13 1609.3    3099     Return an xarray object from the pandas object.
#13 1609.3    3100
#13 1609.3    (...)
#13 1609.3    3171         speed    (date, animal) int64 350 18 361 15
#13 1609.3    3172     """
#13 1609.3 -> 3173     xarray = import_optional_dependency("xarray")
#13 1609.3    3175     if self.ndim == 1:
#13 1609.3    3176         return xarray.DataArray.from_series(self)
#13 1609.3
#13 1609.3 File /srv/conda/envs/xarray-doc/lib/python3.9/site-packages/pandas/compat/_optional.py:164, in import_optional_dependency(name, extra, errors, min_version)
#13 1609.3     162             return None
#13 1609.3     163         elif errors == "raise":
#13 1609.3 --> 164             raise ImportError(msg)
#13 1609.3     166 return module
#13 1609.3
#13 1609.3 ImportError: Pandas requires version '0.15.1' or newer of 'xarray' (version '0.1.dev1+gfdf7303' currently installed).
#13 1609.3
#13 1609.3 <<<-------------------------------------------------------------------------
#13 1609.3
#13 1609.3 Exception occurred:
#13 1609.3   File "/srv/conda/envs/xarray-doc/lib/python3.9/site-packages/IPython/sphinxext/ipython_directive.py", line 584, in process_input
#13 1609.3     raise RuntimeError('Non Expected exception in `{}` line {}'.format(filename, lineno))
#13 1609.3 RuntimeError: Non Expected exception in `/xarray/doc/getting-started-guide/quick-overview.rst` line 169
#13 1609.3 The full traceback has been saved in /tmp/sphinx-err-a01y480j.log, if you want to report the issue to the developers.
#13 1609.3 Please also report this if it was a user error, so that a better error message can be provided next time.
#13 1609.3 A bug report can be filed in the tracker at <https://github.com/sphinx-doc/sphinx/issues>. Thanks!
#13 1611.4 make: *** [Makefile:58: html] Error 2
#13 1611.6 ERROR conda.cli.main_run:execute(41): `conda run /bin/bash -c make html` failed. (See above for error)
#13 ERROR: executor failed running [conda run --no-capture-output -n xarray-doc /bin/bash -c make html]: exit code: 2
------
 > [10/10] RUN make html:
------
executor failed running [conda run --no-capture-output -n xarray-doc /bin/bash -c make html]: exit code: 2

@dcherian dcherian requested a review from rabernat May 2, 2022 19:32
@dcherian
Copy link
Contributor

dcherian commented May 2, 2022

Does this part of the docs help

@andersy005
Copy link
Member

ImportError: Pandas requires version '0.15.1' or newer of 'xarray' (version '0.1.dev1+gfdf7303' currently installed).

looking at the reported xarray version, i'm curious... 🧐 are you using a shallow git clone of xarray? I'm able to reproduce the version issue via these steps:

git clone --depth 1 [email protected]:pydata/xarray.git
cd xarray
python -m pip install -e . 
 conda list xarray                                                                                                                              ─╯
# packages in environment at /Users/andersy005/mambaforge/envs/test:
#
# Name                    Version                   Build  Channel
xarray                    0.1.dev1+g126051f           dev_0    <develop>

doc/user-guide/io.rst Outdated Show resolved Hide resolved
doc/user-guide/io.rst Outdated Show resolved Hide resolved
doc/user-guide/io.rst Outdated Show resolved Hide resolved
doc/user-guide/io.rst Outdated Show resolved Hide resolved
doc/user-guide/io.rst Outdated Show resolved Hide resolved
doc/user-guide/io.rst Outdated Show resolved Hide resolved
@delgadom
Copy link
Contributor Author

delgadom commented May 5, 2022

!!! @andersy005 thank you so much! yes - I was using a shallow clone inside the docker version of the build. I really appreciate the review and for catching my error. I'll clean this up and push the changes.

@delgadom
Copy link
Contributor Author

delgadom commented May 5, 2022

@dcherian - yep I was trying to follow that guide closely but was still struggling with building using conda on my laptop's miniconda environment. The sphinx ipython directive kept running in the wrong conda environment, even after deleting sphinx, ipython, and ipykernel on my base env and making sure the env I was running make from had all the dependencies 🤷🏼‍♂️ . The docker image helped with the conda version issue but then I ran afoul of the xarray version because I used a shallow clone. oopps 😞

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic! Thanks so much @delgadom.

Reading these docs make me think that perhaps we should consider changing the defaults to something more intuitive. But we can address that in the future.

@dcherian dcherian enabled auto-merge (squash) May 6, 2022 14:29
@delgadom
Copy link
Contributor Author

delgadom commented May 6, 2022

now I'm getting reproducible build timeouts 😭

The docs build alone (just make html) takes around an hour on my laptop for xarray/main and just a bit longer on my branch. readthedocs seems to be timing out at 31 minutes. How does this usually work? or am I doing something wrong again?

Thanks for all the help everyone!

@jakirkham
Copy link

FWIW there's a similar doc page about chunk size in Dask that may be worth borrowing from

@max-sixty
Copy link
Collaborator

now I'm getting reproducible build timeouts 😭

The docs build alone (just make html) takes around an hour on my laptop for xarray/main and just a bit longer on my branch. readthedocs seems to be timing out at 31 minutes. How does this usually work? or am I doing something wrong again?

Thanks for all the help everyone!

That's not fun at all. FWIW most seem to pass, so you may have been unlucky.

I restarted the build by merging main into the branch. Let's hope this completes.

Thanks a lot for both the contribution and your patience @delgadom ...

@delgadom
Copy link
Contributor Author

thank you all for being patient with this PR! seems the build failed again for the same reason. I think there might be something wrong with my examples, though it beats me what the issue is. As far as I can tell, most builds come in somewhere in the mid 900s range on readthedocs but my branch consistently times out at 1900s. I'll see if I can muster a bit of time to run through the exact rtd build workflow and figure out what's going on but probably won't get to it until this weekend.

@delgadom
Copy link
Contributor Author

delgadom commented May 10, 2022

@jakirkham were you thinking a reference to the dask docs for more info on optimal chunk sizing and aligning with storage? or are you suggesting the proposed docs change is too complex?

I was trying to address the lack of documentation on specifying chunks within a zarr array for non-dask arrays/coordinates, but also covering the weedsy (but common) case of datasets with a mix of dask & in-memory arrays/coords like in my example. I have been frustrated by zarr stores I've written with a couple dozen array chunks and thousands of coordinate chunks for this reason, but it's definitely a gnarly topic to cover concisely :P

@jakirkham
Copy link

@jakirkham were you thinking a reference to the dask docs for more info on optimal chunk sizing and aligning with storage?

It could make sense to refer to or if similar ideas come up here it may be worth mentioning in this change

or are you suggesting the proposed docs change is too complex?

Not at all.

I was trying to address the lack of documentation on specifying chunks within a zarr array for non-dask arrays/coordinates, but also covering the weedsy (but common) case of datasets with a mix of dask & in-memory arrays/coords like in my example. I have been frustrated by zarr stores I've written with a couple dozen array chunks and thousands of coordinate chunks for this reason, but it's definitely a gnarly topic to cover concisely :P

If there's anything you need help with or would like to discuss, please don't hesitate to raise a Zarr issue. We also enabled GH discussions over there so if that fits better feel free to use that 🙂


.. ipython:: python

ds = xr.tutorial.open_dataset("rasm")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this an idealized example to avoid having to download the dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be wrong, but don't we use this elsewhere, too? In that case, pooch shouldn't re-download this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I thought - it's definitely used in a handful of places already, including within an ipython directive in the sphinx docs (see e.g. zarr encoding specification).

@delgadom
Copy link
Contributor Author

If there's anything you need help with or would like to discuss, please don't hesitate to raise a Zarr issue. We also enabled GH discussions over there so if that fits better feel free to use that 🙂

@jakirkham appreciate that! And just to clarify my confusion/frustration in the past was simply around the issue that I'm documenting here, which I've finally figured out! so hopefully this will help resolve the problem for future users. I agree with Ryan that there might be some changes in the defaults that could be helpful here, though setting intuitive defaults other than zarr's defaults could get messy for complex datasets with a mix of dask and in-memory arrays. But I think that's all on the xarray side?

@Illviljan Illviljan added plan to merge Final call for comments and removed needs review labels Jun 18, 2022
@andersy005
Copy link
Member

Thank you, @delgadom!

@andersy005 andersy005 disabled auto-merge June 23, 2022 21:31
@andersy005 andersy005 merged commit ec41d65 into pydata:main Jun 23, 2022
@andersy005 andersy005 removed the plan to merge Final call for comments label Jun 23, 2022
@delgadom
Copy link
Contributor Author

@andersy005 last I saw this was still not building on readthedocs! I never figured out how to get around the build timeout are you sure this PR was good to go?

@andersy005
Copy link
Member

my bad :) thank you for reporting this in #6720. I'm going to look into what;s going on

dcherian added a commit to bzah/xarray that referenced this pull request Jun 24, 2022
* main: (129 commits)
  docs on specifying chunks in to_zarr encoding arg (pydata#6542)
  [skip-ci] List count under Aggregation (pydata#6711)
  Add `Dataset.dtypes` property (pydata#6706)
  try to import `UndefinedVariableError` from the new location (pydata#6701)
  DOC: note of how `chunks` can be defined (pydata#6696)
  pdyap version dependent client.open_url call (pydata#6656)
  use `pytest-reportlog` to generate upstream-dev CI failure reports (pydata#6699)
  [pre-commit.ci] pre-commit autoupdate (pydata#6694)
  Bump actions/setup-python from 3 to 4 (pydata#6692)
  Fix Dataset.where with drop=True and mixed dims (pydata#6690)
  pass kwargs through from save_mfdataset to to_netcdf (pydata#6686)
  Docs: indexing.rst finetuning (pydata#6685)
  use micromamba instead of mamba (pydata#6674)
  install the development version of `matplotlib` into the upstream-dev CI (pydata#6675)
  Add whatsnew section for v2022.06.0
  release notes for 2022.06.0rc0
  release notes for the pre-release (pydata#6676)
  more testpypi workflow fixes (pydata#6673)
  thin: add examples (pydata#6663)
  Update multidimensional-coords.ipynb (pydata#6672)
  ...
@keewis keewis mentioned this pull request Jun 25, 2022
1 task
@delgadom delgadom deleted the patch-1 branch July 1, 2022 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants