-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebook crashes after calling .to_dask_dataframe #6811
Comments
I ran this script in an ipython session on It appears that I'm mistaken in my understanding of xarray. I thought that xarray creates the Dataset/DataArray lazily but the high memory usage (for a brief time) indicates that that is not the case. This brings up yet another question: how come the memory usage is down to 4 gb (from I have been unable to find answers to my questions in the documentation. Can you please point me to docs (user or developer) which can help me clear my misunderstandings? Thanks in advance! |
You can check whether it computes using the following from xarray.tests import raise_if_dask_computes
with raise_if_dask_computes():
ds.to_dask_dataframe() Does that raise an error? |
No, i don't think this raised an error. This is what I see in my ipython session (from running
|
Well then it isn't computing. Depending on the shape of your DataArray, there's potentially a reshape involved which can be expensive in parallel: https://docs.dask.org/en/stable/array-chunks.html#reshaping . It does manifest as large memory usage. |
Hmm...i read! If it (what is the "it" here? dask or xarray?) isn't computing, what are the values for I tried
which chose the chunk size of |
Using chunks of shape (4000, 4000) isn't very different than what we were using originally (10_000, 10_000), so I'm not surprised the results are the same. After reading https://docs.dask.org/en/stable/array-chunks.html#reshaping I thought we could avoid the problem they discuss by making the chunks take up the entire width of the array using |
This is surprising. The reshape is the expensive step but with this chunking it should be operating blockwise. |
We want mainly to do the same thing as @lewfish .
(I try lot of different possible chunks ...) |
What happened?
We are trying to convert a 17gb Zarr dataset to Parquet using xArray by calling
xr.to_dask_dataframe
and thenddf.to_parquet
. When callingto_dask_dataframe
the notebook crashes with "Kernel Restarting: The kernel for debug/minimal.ipynb appears to have died. It will restart automatically." We also find this occurs when using a synthetic dataset of the same size which we create in the example below.What did you expect to happen?
We expected a Dask dataframe object to be created lazily and not crash the notebook. We expected the operation to be lazy based on the source code, and the following line in the docs "For datasets containing dask arrays where the data should be lazily loaded, see the Dataset.to_dask_dataframe() method."
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
This operation crashes when the size of the array is above some (presumably machine specific) threshold, and works below it. You may need to play with the array size to replicate this behavior.
Environment
INSTALLED VERSIONS
commit: None
python: 3.9.12 | packaged by conda-forge | (main, Mar 24 2022, 23:25:59)
[GCC 10.3.0]
python-bits: 64
OS: Linux
OS-release: 5.4.196-108.356.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: None
libnetcdf: None
xarray: 2022.3.0
pandas: 1.4.2
numpy: 1.22.3
scipy: None
netCDF4: None
pydap: None
h5netcdf: None
h5py: None
Nio: None
zarr: 2.12.0
cftime: None
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.05.0
distributed: 2022.5.0
matplotlib: 3.5.2
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.0
cupy: None
pint: None
sparse: None
setuptools: 62.3.1
pip: 22.1
conda: None
pytest: None
IPython: 8.3.0
sphinx: None
The text was updated successfully, but these errors were encountered: