Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray.backends refactor #2261

Merged
merged 45 commits into from
Oct 9, 2018
Merged

xarray.backends refactor #2261

merged 45 commits into from
Oct 9, 2018

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jul 1, 2018

A major refactor of xarray backend classes:

  • A new least-recently-used cache facilitates keeping track of open files rather than requiring them to be reopened.
  • A number of internal simplifications/clean-ups that should make backends easier to write and maintain:
    • PickleByReconstructionWrapper and DataStorePickleMixin have been eliminated in favor of CachingFIleManager.
    • Backends no longer store any state beyond their wrapped files. This fixes a bug where dask graphs were accidentally being serialized in pickled backend classes (oops!).
    • Locks are now setup by backend classes directly, rather than separately in to_netcdf/open_dataset.

xref #2121
fixes #1738
fixes #2376

  • Tests added
  • Tests passed

Benchmark numbers:

    before     after       ratio
  [66a8f8dd] [c8fbadcc]
+   65.27ms    89.91ms      1.38  dataset_io.IOReadSingleNetCDF3.time_load_dataset_netcdf4
+  866.65ms   960.37ms      1.11  dataset_io.IOReadMultipleNetCDF3Dask.time_load_dataset_netcdf4_with_time_chunks_multiprocessing
-  676.18ms   610.92ms      0.90  dataset_io.IOReadMultipleNetCDF3Dask.time_load_dataset_netcdf4_with_block_chunks
-   61.16ms    55.20ms      0.90  dataset_io.IOReadSingleNetCDF4.time_load_dataset_netcdf4
-  167.54ms   150.41ms      0.90  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_netcdf4_with_time_chunks
-     1.11s   981.78ms      0.89  dataset_io.IOReadMultipleNetCDF3Dask.time_load_dataset_scipy_with_time_chunks
-  247.82ms   219.50ms      0.89  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_netcdf4_with_block_chunks_vindexing
-   81.04ms    71.70ms      0.88  dataset_io.IOReadMultipleNetCDF4Dask.time_open_dataset_netcdf4_with_block_chunks_multiprocessing
-   79.17ms    70.01ms      0.88  dataset_io.IOReadMultipleNetCDF4Dask.time_open_dataset_netcdf4_with_time_chunks_multiprocessing
-  224.41ms   198.10ms      0.88  dataset_io.IOWriteMultipleNetCDF3.time_write_dataset_scipy
-   67.25ms    59.26ms      0.88  dataset_io.IOReadMultipleNetCDF3.time_open_dataset_netcdf4
-  462.51ms   402.37ms      0.87  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_scipy_with_block_chunks
-   80.73ms    68.47ms      0.85  dataset_io.IOReadMultipleNetCDF4Dask.time_open_dataset_netcdf4_with_time_chunks
-  221.97ms   184.93ms      0.83  dataset_io.IOReadSingleNetCDF4Dask.time_load_dataset_netcdf4_with_block_chunks_vindexing
-     1.00s   825.75ms      0.82  dataset_io.IOWriteSingleNetCDF3.time_write_dataset_netcdf4
-  462.15ms   376.22ms      0.81  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_scipy_with_time_chunks
-     3.27s      2.63s      0.81  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_scipy_with_block_chunks_oindexing
-  125.08ms    99.78ms      0.80  dataset_io.IOReadSingleNetCDF3.time_load_dataset_scipy
-  105.12ms    82.64ms      0.79  dataset_io.IOReadMultipleNetCDF3Dask.time_open_dataset_scipy_with_block_chunks
-  112.96ms    88.28ms      0.78  dataset_io.IOReadMultipleNetCDF3.time_open_dataset_scipy
-   25.36ms    19.09ms      0.75  dataset_io.IOReadSingleNetCDF3.time_orthogonal_indexing
-  204.93ms   142.00ms      0.69  dataset_io.IOWriteSingleNetCDF3.time_write_dataset_scipy
-   26.73ms    17.96ms      0.67  dataset_io.IOReadSingleNetCDF3.time_vectorized_indexing
-  102.62ms    68.54ms      0.67  dataset_io.IOReadMultipleNetCDF3Dask.time_open_dataset_scipy_with_time_chunks
-  212.34ms   139.11ms      0.66  dataset_io.IOReadSingleNetCDF3Dask.time_load_dataset_scipy_with_block_chunks_vindexing
-  100.13ms    62.25ms      0.62  dataset_io.IOReadMultipleNetCDF3Dask.time_open_dataset_netcdf4_with_time_chunks
-  822.44ms   467.54ms      0.57  dataset_io.IOReadMultipleNetCDF3.time_load_dataset_scipy
-  666.34ms   274.17ms      0.41  dataset_io.IOReadMultipleNetCDF3.time_load_dataset_netcdf4
-  593.33ms   214.17ms      0.36  dataset_io.IOWriteNetCDFDask.time_write
-     4.15s      1.49s      0.36  dataset_io.IOWriteNetCDFDaskDistributed.time_write

shoyer added 2 commits June 30, 2018 22:55
This is intended to replace both PickleByReconstructionWrapper and
DataStorePickleMixin with something more compartmentalized.

xref GH2121
Callable that opens a given file when called, returning a file
object.
mode : str, optional
If provided, passed to opener as a keyword argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W291 trailing whitespace

ExplicitFileManager, LazyFileManager, AutoclosingFileManager,
]

@pytest.mark.parametrize('manager_type', FILE_MANAGERS)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E302 expected 2 blank lines, found 1

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what I'm seeing here. Mostly API questions for now, I did not review the tests yet.

class FileManager(object):
"""Base class for context managers for managing file objects.

Unlike files, FileManager objects should be safely. They must be explicitly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this description is missing a few words.

manager.close()
"""

def __init__(self, opener, mode=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also support **kwargs here. Or maybe that's all we should support here. Or, perhaps you are thinking opener would be a partial function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of opener as a partial, but I agree that it would probably be easier to understand if args and kwargs are passed directly.

class ExplicitFileManager(FileManager):
"""A file manager that holds a file open until explicitly closed.

This is mostly a reference implementation: must real use cases should use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must->most


def __init__(self, opener, mode=_DEFAULT_MODE):
self._opener = opener
# file has already been created, don't override when restoring
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you expand on this a bit? How do we KNOW that the file has already been created? I'm wondering if the mode switch should go after the file open line.

def __init__(self, opener, mode=_DEFAULT_MODE):
self._opener = opener
self._mode = mode
self._lock = threading.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on allowing other locks to be passed in here? Do we need to support the CombinedLock concept as well?

@shoyer
Copy link
Member Author

shoyer commented Jul 9, 2018

@jhamman thanks for taking a look. I'm going to push another iteration of this shortly (OK, a major rewrite) where there is only a single FileManager object which uses an LRU cache.

@@ -1,5 +1,6 @@
import os
from collections import OrderedDict
import functools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'functools' imported but unused

@@ -0,0 +1,60 @@
import pickle

from xarray.backends.file_manager import FileManager, FILE_CACHE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'xarray.backends.file_manager.FILE_CACHE' imported but unused

manager.close()
"""

def __init__(self, opener, *args, **kwargs):
Copy link
Member Author

@shoyer shoyer Jul 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to use dependency injection for the cache (e.g., cache=FILE_CACHE), which unfortunately means that we'll need to change the signature here from using **kwargs.

Any opinions on what this should look like? I'm thinking maybe:

_DEFAULT = object()
def __init__(self, opener, *args, mode=_DEFAULT, kwargs=None, cache=FILE_CACHE)

"""

def __init__(self, opener, *args,
mode=_DEFAULT_MODE,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E999 SyntaxError: invalid syntax

@shoyer
Copy link
Member Author

shoyer commented Jul 11, 2018

OK, this is ready for review.

assert_identical(expected, actual)
with self.roundtrip(expected,
save_kwargs=fmtkw,
open_kwargs={'backend_kwargs': fmtkw}) as actual:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (81 > 79 characters)

@shoyer
Copy link
Member Author

shoyer commented Jul 29, 2018

As an experiment, I rewrote the SciPy netCDF backend to use FileManager:

  • The code is now significantly simpler -- all the ensure_open() business could simply be removed.
  • We used to see a bunch of warnings about not closing memory mapped files ("RuntimeWarning: Cannot close a netcdf_file opened with mmap=True, when netcdf_variables or arrays referring to its data still exist."). These have all gone away!
  • compute=False now magically works (I only had to remove the explicitly raised error!)

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer - nice work on this. I was skeptical on this one until I saw how it cleaned up the backend implementations. I'm sold!

I haven't looked at the tests just yet but will get to them this week.

self._mode = 'a'
self._key = self._make_key()
self._cache[self._key] = file
return file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to override the builtin file function here. Perhaps we can use fh or something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file is only a builtin on Python 2... are we still concerned about overriding it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this and I'm happy to hear it. (I can't wait to be done with Python 2)

value = self._cache[key]
# On Python 3, could just use: self._cache.move_to_end(key)
del self._cache[key]
self._cache[key] = value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using the move_to_end here and catching the exception for python2 only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe a little helper function in pycompat that we can cleanup when python 2 is dropped.

def move_to_end(cache, key):
    try:
        cache.move_to_end(key)
    except AttributeError:
        del self._cache[key]
        self._cache[key] = value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, easy enough. Done.

"""
if maxsize < 0:
raise ValueError('maxsize must be non-negative')
self._maxsize = maxsize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we enforce maxsize is an integer? I'm thinking that it may be easy to see None/False as valid values. I think that case is going to break things downstream.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jhamman jhamman mentioned this pull request Jul 29, 2018
1 task
import logging
import multiprocessing
import threading
import time
import traceback
import warnings
import weakref
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'weakref' imported but unused

self._opener = opener
self._mode = mode
if (lock is None and mode != 'r'
and isinstance(filename_or_obj, basestring)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W503 line break before binary operator

@shoyer shoyer changed the title xarray.backends refactor to use an LRU cache xarray.backends refactor Sep 11, 2018
@shoyer shoyer mentioned this pull request Sep 19, 2018
@shoyer
Copy link
Member Author

shoyer commented Sep 27, 2018

I'd love to move this forward. I think it will fix some serious usability and performance issues with distributed reads/writes of netCDF files.

@jhamman
Copy link
Member

jhamman commented Sep 27, 2018

I'd also be happy to see this go in. We could use a review from someone other than me.

@shoyer
Copy link
Member Author

shoyer commented Sep 27, 2018

At some point soon I'm just going to merge this, more review or not! Hopefully a release candidate will catch any major issues.

@pep8speaks
Copy link

pep8speaks commented Oct 8, 2018

Hello @shoyer! Thanks for updating the PR.

Line 139:17: W504 line break after binary operator

Line 2639:33: W504 line break after binary operator
Line 2640:33: W504 line break after binary operator

Comment last updated on October 09, 2018 at 02:31 Hours UTC

@jhamman
Copy link
Member

jhamman commented Oct 9, 2018

based on the arrival of #2476 (!), I suggest we merge this. I think we've had enough review to justify this being put into a release candidate in the relatively near future.

@shoyer
Copy link
Member Author

shoyer commented Oct 9, 2018

Yep, that's my plan. I just did a read through code again and identified a few unreachable lines, which I removed. I'll merge when CI passes.

@shoyer shoyer merged commit 289b377 into pydata:master Oct 9, 2018
@jhamman
Copy link
Member

jhamman commented Oct 9, 2018

Nice work on this @shoyer. Really excited to set this free.

dcherian pushed a commit to maahn/xarray that referenced this pull request Oct 10, 2018
* master: (51 commits)
  xarray.backends refactor (pydata#2261)
  Fix indexing error for data loaded with open_rasterio (pydata#2456)
  Properly support user-provided norm. (pydata#2443)
  pep8speaks (pydata#2462)
  isort (pydata#2469)
  tests shoudn't need to pass for a PR (pydata#2471)
  Replace the last of unittest with pytest (pydata#2467)
  Add python_requires to setup.py (pydata#2465)
  Update whats-new.rst (pydata#2466)
  Clean up _parse_array_of_cftime_strings (pydata#2464)
  plot.contour: Don't make cmap if colors is a single color. (pydata#2453)
  np.AxisError was added in numpy 1.13 (pydata#2455)
  Add CFTimeIndex.shift (pydata#2431)
  Fix FutureWarning in CFTimeIndex.date_type (pydata#2448)
  fix:2445 (pydata#2446)
  Enable use of cftime.datetime coordinates with differentiate and interp (pydata#2434)
  restore ddof support in std (pydata#2447)
  Future warning for default reduction dimension of groupby (pydata#2366)
  Remove incorrect statement about "drop" in the text docs (pydata#2439)
  Use profile mechanism, not no-op mutation (pydata#2442)
  ...
dcherian pushed a commit to dcherian/xarray that referenced this pull request Oct 10, 2018
* master: (21 commits)
  xarray.backends refactor (pydata#2261)
  Fix indexing error for data loaded with open_rasterio (pydata#2456)
  Properly support user-provided norm. (pydata#2443)
  pep8speaks (pydata#2462)
  isort (pydata#2469)
  tests shoudn't need to pass for a PR (pydata#2471)
  Replace the last of unittest with pytest (pydata#2467)
  Add python_requires to setup.py (pydata#2465)
  Update whats-new.rst (pydata#2466)
  Clean up _parse_array_of_cftime_strings (pydata#2464)
  plot.contour: Don't make cmap if colors is a single color. (pydata#2453)
  np.AxisError was added in numpy 1.13 (pydata#2455)
  Add CFTimeIndex.shift (pydata#2431)
  Fix FutureWarning in CFTimeIndex.date_type (pydata#2448)
  fix:2445 (pydata#2446)
  Enable use of cftime.datetime coordinates with differentiate and interp (pydata#2434)
  restore ddof support in std (pydata#2447)
  Future warning for default reduction dimension of groupby (pydata#2366)
  Remove incorrect statement about "drop" in the text docs (pydata#2439)
  Use profile mechanism, not no-op mutation (pydata#2442)
  ...
@shoyer shoyer deleted the file-manager branch October 11, 2018 03:07
Chilipp added a commit to psyplot/psyplot that referenced this pull request Nov 9, 2018
The xarray.backends.api.to_netcdf function has been changed in pydata/xarray#2261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants