Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make xr.corr and xr.map_blocks work without dask #5731

Merged
merged 14 commits into from
Nov 24, 2021
Merged

Conversation

Gijom
Copy link
Contributor

@Gijom Gijom commented Aug 23, 2021

Gijom added 2 commits August 23, 2021 08:46
Calls to dask functions are replaced by calls to the
pycompat functions
@Gijom
Copy link
Contributor Author

Gijom commented Aug 23, 2021

Concerning the tests I added a test to check that lazy correlations have identical results to non-lazy correlations. But this is not directly related to the bug fix. Indeed there is a need to have a test that checks if the non-lazy correlation works without dask installed. The already existing test function test_corr does not have a @requires_dask decorator. But it did not fail as it should have. Not sure why.

About documenting the changes, I am not sure of the whats-new.rst format and I thus did not add it. Should I just add a bullet item on the top of the file ?

xarray/core/parallel.py Outdated Show resolved Hide resolved
@dcherian
Copy link
Contributor

dcherian commented Aug 23, 2021

Thanks @Gijom!

he already existing test function test_corr does not have a @requires_dask decorator. But it did not fail as it should have. Not sure why.

I can't tell either. It would be good to track this down.

I am not sure of the whats-new.rst format and I thus did not add it. Should I just add a bullet item on the top of the file ?

Yes just copy an existing entry and modify it.

Gijom added 2 commits August 23, 2021 17:33
The fonction is used to test for lazy datasets in map_blocks
(parallel.py)
@Gijom
Copy link
Contributor Author

Gijom commented Aug 24, 2021

I included the proposed changes, thanks for the help.

However the new test I implemented does not go through and I fail to understand why. Would you be able to check ? The idea of the test is to ensure that lazy computations give the same results than normal ones. I tracked down the problem to line 1377 of computation.py:

demeaned_da_a = da_a - da_a.mean(dim=dim)  # <-- this one returns nan upon computation (.compute())
demeaned_da_b = da_b - da_b.mean(dim=dim)  # <-- this one returns the correct value although the input is masked with nans in the same way

The values before the mean call are:

da_a.compute() = <xarray.DataArray <this-array> (x: 2, time: 2)>
array([[ 1.,  2.],
       [ 2., nan]])
da_b.compute() = <xarray.DataArray <this-array> (x: 2, time: 2)>
array([[ 1.,  2.],
       [ 1., nan]])

And for the means (dim is None):

da_a.mean(dim=dim).compute()=<xarray.DataArray <this-array> ()>
array(nan)
da_b.mean(dim=dim).compute()=<xarray.DataArray <this-array> ()>
array(1.33333333)

For non lazy computations everything seems fine.

@keewis
Copy link
Collaborator

keewis commented Aug 30, 2021

The already existing test function test_corr does not have a @requires_dask decorator. But it did not fail as it should have. Not sure why.

the reason for this is that dask is imported using importorskip at the top of the module:

dask = pytest.importorskip("dask")
Since most tests already have requires_dask I think it should be fine to just remove the importandskip call.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 30, 2021

Unit Test Results

         6 files           6 suites   49m 53s ⏱️
16 246 tests 14 512 ✔️ 1 734 💤 0 ❌
90 672 runs  82 496 ✔️ 8 176 💤 0 ❌

Results for commit 3dde39b.

♻️ This comment has been updated with latest results.

Gijom added 3 commits August 30, 2021 18:36
The importskip was canceling all tests if dask is not present. Some
tests are relevant to be done without dask.

@require_dask was added to one test since it import dask directly
The tests do not pass for arrays 5 and 6. More work is needed to
indentify why
@Gijom
Copy link
Contributor Author

Gijom commented Aug 31, 2021

@keewis indeed this was the problem. I removed it and corrected one test which did not have the @require_dask decorator.

There is still a test error with the test arrays 5 and 6: the lazy array version do not return the same value than the usual corr. I consider this a different bug and just removed the tests with a TODO in the code.

The last commit pass the tests on my machine and I added the whats-new text to version 0.19.1.

@dcherian
Copy link
Contributor

Thanks @Gijom can you post the error for those tests please.

@Gijom
Copy link
Contributor Author

Gijom commented Aug 31, 2021

I was referring to:
#5731 (comment)

Otherwise I am working on the conflict with the main branch. Should be done soon.

@dcherian
Copy link
Contributor

dcherian commented Aug 31, 2021

Thanks @Gijom, I can't reproduce. Can you post the output of xr.show_versions()?

Here's mine

INSTALLED VERSIONS
------------------
commit: None
python: 3.8.10 | packaged by conda-forge | (default, May 10 2021, 22:58:09) 
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 19.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.19.0
pandas: 1.3.1
numpy: 1.21.1
scipy: 1.5.3
netCDF4: 1.5.6
pydap: None
h5netcdf: 0.11.0
h5py: 3.3.0
Nio: None
zarr: 2.8.3
cftime: 1.5.0
nc_time_axis: 1.3.1
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: 3.0.4
bottleneck: 1.3.2
dask: 2021.07.2
distributed: 2021.07.2
matplotlib: 3.4.2
cartopy: 0.19.0.post1
seaborn: 0.11.1
numbagg: None
pint: 0.17
setuptools: 49.6.0.post20210108
pip: 21.2.3
conda: 4.10.3
pytest: 6.2.4
IPython: 7.26.0
sphinx: 4.1.2

@dcherian
Copy link
Contributor

oh never mind: I can reproduce with

da_a = xr.DataArray([[1, 2], [2, 1]], dims=["x", "time"])
da_b = xr.DataArray([[1, 2], [1, np.nan]], dims=["x", "time"])

dim = None

xr.testing.assert_equal(
    xr.corr(da_a, da_b, dim=dim), xr.corr(da_a.chunk(), da_b.chunk(), dim=dim)
)
xr.testing.assert_equal(
    xr.corr(da_a, da_b, dim=dim), xr.corr(da_a, da_b.chunk(), dim=dim)
)

@dcherian
Copy link
Contributor

Ouch here's the problem: da_a has dtype int if it has no NaNs. this dtype could change but need not change after the masking.

Importantly skipna=False by default for int dtypes (since they can't represent NaNs).

The fix is to specify skipna=True explicitly in the .mean calls.

Alternatively we could stick an .astype(float) in _get_valid_values

@TomNicholas TomNicholas mentioned this pull request Oct 23, 2021
@dcherian dcherian marked this pull request as draft October 24, 2021 09:56
@shoyer
Copy link
Member

shoyer commented Nov 23, 2021

Checking in here -- it seems like this PR surfaced an existing bug (incorrect calculation of correlation when using dask), at the same time as it corrected a different bug, which is removing the hard dependency on dask for calculating correlation?

If I'm understanding correctly, this is still a net win (fixing one bug), so it probably makes sense to merge it now and save the other bug fix for a later issue?

* upstream/main: (86 commits)
  Fixed a mispelling of dimension in dataarray documentation for from_dict (pydata#6020)
  [pre-commit.ci] pre-commit autoupdate (pydata#6014)
  [pre-commit.ci] pre-commit autoupdate (pydata#5990)
  Use set_options for asv bottleneck tests (pydata#5986)
  Fix module name retrieval in `backend.plugins.remove_duplicates()`, plugin tests (pydata#5959)
  Check for py version instead of try/except when importing entry_points (pydata#5988)
  Add "see also" in to_dataframe docs (pydata#5978)
  Alternate method using inline css to hide regular html output in an untrusted notebook (pydata#5880)
  Fix mypy issue with entry_points (pydata#5979)
  Remove pre-commit auto update (pydata#5958)
  Do not change coordinate inplace when throwing error (pydata#5957)
  Create CITATION.cff (pydata#5956)
  Add groupby & resample benchmarks (pydata#5922)
  Fix plot.line crash for data of shape (1, N) in _title_for_slice on format_item (pydata#5948)
  Disable unit test comments (pydata#5946)
  Publish test results from workflow_run only (pydata#5947)
  Generator for groupby reductions (pydata#5871)
  whats-new dev
  whats-new for 0.20.1 (pydata#5943)
  Docs: fix URL for PTSA (pydata#5935)
  ...
@dcherian dcherian marked this pull request as ready for review November 24, 2021 03:17
@dcherian
Copy link
Contributor

so it probably makes sense to merge it now and save the other bug fix for a later issue?

It was easy .Should be good to go if tests pass.

@dcherian dcherian changed the title Fix bug on duck access without dask Make xr.corr and xr.map_blocks work without dask Nov 24, 2021
Copy link
Collaborator

@mathause mathause left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a test for #3391? @dcherian? Could also add a what's new entry for #3391.

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/tests/test_computation.py Outdated Show resolved Hide resolved
@Gijom
Copy link
Contributor Author

Gijom commented Nov 24, 2021

Sorry for the unresponsiveness. I confirm that using the asfloat conversion also works for me. Should I close the issue or should I let you do it ?

@dcherian
Copy link
Contributor

Thanks @Gijom

@dcherian dcherian merged commit 21e8484 into pydata:main Nov 24, 2021
@shoyer
Copy link
Member

shoyer commented Nov 24, 2021

I have a follow-up PR that simplifies the internal logic a little bit more: #6025

dcherian added a commit to dcherian/xarray that referenced this pull request Dec 2, 2021
* upstream/main:
  fix grammatical typo in docs (pydata#6034)
  Use condas dask-core in ci instead of dask to speedup ci and reduce dependencies (pydata#6007)
  Use complex nan by default when interpolating out of bounds (pydata#6019)
  Simplify missing value handling in xarray.corr (pydata#6025)
  Add pyXpcm to Related Projects doc page (pydata#6031)
  Make xr.corr and xr.map_blocks work without dask (pydata#5731)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dask error on xarray.corr map_blocks doesn't work when dask isn't installed
5 participants