Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes OS error arising from too many files open #1198

Merged
merged 1 commit into from
Mar 23, 2017

Conversation

pwolfram
Copy link
Contributor

@pwolfram pwolfram commented Jan 10, 2017

Previously, DataStore did not judiciously close files, resulting in opening a large number of files that
could result in an OSError related to too many files being open. This merge provides a solution for the netCDF, scipy, and h5netcdf backends.

@pwolfram
Copy link
Contributor Author

@pwolfram pwolfram changed the title Fixes OS error arrising from too many files open (netCDF and scripy backends) WIP: Fixes OS error arrising from too many files open Jan 10, 2017
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very nice to see some progress on this!

# netCDF4 only allows closing the root group
while ds.parent is not None:
ds = ds.parent
if ds.isopen():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this while loop in a helper function, something like _find_root?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, sounds good.

@@ -249,6 +248,8 @@ def maybe_decode_store(store, lock=False):
else:
ds2 = ds

store.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably guard this behind an option?

self._filename = filename
self._mode = 'a' if mode == 'w' else mode
self._opener = functools.partial(_open_netcdf4_group, filename,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we reuse the same partial created above for opener, maybe just with a different value for the mode argument? (would be nice to have less code duplication)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to self._opener = functools.partial(opener, mode=self._mode). Is this the best way to do this from previous usage of

  self._opener = opener(mode=self._mode)
                                         mode=self._mode,
                                         group=group, clobber=clobber,
                                         diskless=diskless, persist=persist,
                                         format=format)

?

@@ -492,6 +492,21 @@ def create_tmp_file(suffix='.nc', allow_cleanup_failure=False):
if not allow_cleanup_failure:
raise

@contextlib.contextmanager
def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you can't reuse create_tmp_file internally here?

This would be most straightforwardly done with contextlib.ExitStack, though we would need a backport to Python 2.7:
https://docs.python.org/3.6/library/contextlib.html#contextlib.ExitStack

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer, this makes the code cleaner. However, it may be operating slower because there could be more contexts in play. It does remove redundant code, which is always a plus.

@pwolfram pwolfram changed the title WIP: Fixes OS error arrising from too many files open WIP: Fixes OS error arising from too many files open Jan 10, 2017
@pwolfram pwolfram force-pushed the fix_too_many_open_files branch 4 times, most recently from 7070fd7 to 349e7db Compare January 11, 2017 16:39
@pwolfram
Copy link
Contributor Author

@shoyer, all the checks "pass" but there are still errors in the "allowed" list. If you get a change could you please provide me some perspective on whether these are errors on my end or not? I'm not exactly sure how to interpret them.

Once I know I have correctness in this code I plan to fix the inlines you graciously highlighted above. I think we are getting close here, assuming that I have enough testing to demonstrate we have accurately fixed the too many open file issue. Any additional ideas you have for tests would be really helpful too.

@shoyer
Copy link
Member

shoyer commented Jan 11, 2017

@pwolfram the allowed failures are pre-existing, not related to this change.

@pwolfram
Copy link
Contributor Author

Thanks @shoyer, Does that mean if the checks pass the code is at least minimally correct in terms of not breaking previous design choices? E.g., does this imply that we are ok except for cleanup / implementation details on this PR?

@shoyer
Copy link
Member

shoyer commented Jan 11, 2017

Does that mean if the checks pass the code is at least minimally correct in terms of not breaking previous design choices? E.g., does this imply that we are ok except for cleanup / implementation details on this PR?

If the checks pass, it means that it doesn't directly break anything that we have tests for. Which should cover most functionality. However, we'll still need to be careful not to introduce performance regressions -- we don't have any automated performance tests yet.

@pwolfram
Copy link
Contributor Author

@shoyer, I just realized this might conflict with #1087. Do you foresee this causing problems and what order do you plan to merge this PR and #1087 (which obviously predates this one...)? We are running into the snag with #463 in our analysis and my personal preference would be to get some type of solution into place sooner than later. Thanks for considering this request.

Also, I'm not sure exactly the best way to test performance either. Could we potentially use something like the "toy" test cases for this purpose? Ideally we would have a test case with O(100) files to gain a clearer picture of the performance cost of this PR.

Please let me know what you want me to do with this PR-- should I clean it up in anticipation of a merge or just wait for now to see if there are extra things that need fixed via additional testing? Note I have the full scipy, h5netcdf and pynio implementations that can also be reviewed because they weren't available when you did your review yesterday.

@pwolfram pwolfram changed the title WIP: Fixes OS error arising from too many files open Fixes OS error arising from too many files open Jan 12, 2017
@shoyer
Copy link
Member

shoyer commented Jan 12, 2017 via email

@shoyer
Copy link
Member

shoyer commented Jan 12, 2017

This should be totally fine without performance or compatibility concerns as long as we set autoclose=False by the default.

In the long term, it would be nice to handle autoclosing automatically (invoking it when the number of open files exceeds some limit), but we should probably be a little more clever for that.

@pwolfram
Copy link
Contributor Author

Thanks @shoyer. This makes sense. I think the path forward on the next round of edits should include making sure existing tests using open_mfdataset use both options to autoclose. If we do this we could future-proof ourselves against loss due to accidental breaking of this new functionality and avoid potentially contaminating existing workflows via performance concerns.

Documentation is also obviously required.

FYI as a heads up, I probably won't be able to get to this mid-week at the earliest but it appears we are close to a viable solution.

@PeterDSteinberg
Copy link

I appreciate your work on this too-many-files-open error - I think your fixes will add a lot of value to the NetCDF multi-file functionality. In this notebook using K-Means clustering on multi-file NetCDF data sets I have repeatedly experienced the too-many-open files error, even with attempts to adjust via ulimit. I can test out the notebook again as this PR is finalized.

@pwolfram
Copy link
Contributor Author

@PeterDSteinberg, did this PR fix the issue for you? I obviously need to update it but just wanted to confirm that the current branch resolved the too-many-open files error issue. Also, do you have any idea of the performance impact of these changes I'm proposing?

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch from 5acafb6 to 1460a07 Compare January 31, 2017 19:50
@pwolfram
Copy link
Contributor Author

@shoyer and @PeterDSteinberg I've updated this PR to reflect requested changes.

@pwolfram pwolfram changed the title Fixes OS error arising from too many files open WIP: Fixes OS error arising from too many files open Feb 2, 2017
@pwolfram
Copy link
Contributor Author

pwolfram commented Feb 2, 2017

There are still a few more issues that need ironed out. I'll let you know when I've resolved them.

@vnoel
Copy link
Contributor

vnoel commented Feb 3, 2017

I'm just chiming in to signify my interest in seeing this issue solved. I have just hit "OSError: Too many open files". The data itself is not even huge, but it's scattered across many files and it's a PITA to revert to manual concatenation -- I've grown used to dask doing the work for me ;-)

@shoyer
Copy link
Member

shoyer commented Feb 4, 2017

@pwolfram this looks pretty close to me now -- let me know when it's ready for review.

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch from 1460a07 to 73b601d Compare February 5, 2017 05:19
@pwolfram
Copy link
Contributor Author

pwolfram commented Feb 5, 2017

@shoyer, the pushed code represents my progress. The initial PR had a bug-- essentially a calculation couldn't be performed following the load. This fixes that bug and provides a test to ensure that this doesn't happen. However, I'm having trouble with h5netcdf, which I'm not very familiar with compared to netcdf. This represents my current progress, I just need some more time (or even inspiration from you) to sort out this last key issue...

I'm getting the following error:

================================================================================================================== FAILURES ==================================================================================================================
___________________________________________________________________________________________ OpenMFDatasetTest.test_4_open_large_num_files_h5netcdf ___________________________________________________________________________________________

self = <xarray.tests.test_backends.OpenMFDatasetTest testMethod=test_4_open_large_num_files_h5netcdf>

    @requires_dask
    @requires_h5netcdf
    def test_4_open_large_num_files_h5netcdf(self):
>       self.validate_open_mfdataset_large_num_files(engine=['h5netcdf'])

xarray/tests/test_backends.py:1040: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
xarray/tests/test_backends.py:1018: in validate_open_mfdataset_large_num_files
    self.assertClose(ds.foo.sum().values, np.sum(randdata))
xarray/core/dataarray.py:400: in values
    return self.variable.values
xarray/core/variable.py:306: in values
    return _as_array_or_item(self._data)
xarray/core/variable.py:182: in _as_array_or_item
    data = np.asarray(data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/numpy/core/numeric.py:482: in asarray
    return array(a, dtype, copy=False, order=order)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:1025: in __array__
    x = self.compute()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:79: in compute
    return compute(self, **kwargs)[0]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/base.py:179: in compute
    results = get(dsk, keys, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:537: in get_sync
    raise_on_exception=True, **kwargs)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:500: in get_async
    fire_task()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:476: in fire_task
    callback=queue.put)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:525: in apply_sync
    res = func(*args, **kwds)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:268: in execute_task
    result = _execute_task(task, data)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:248: in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in _execute_task
    return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:245: in <listcomp>
    return [_execute_task(a, cache) for a in arg]
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/async.py:249: in _execute_task
    return func(*args2)
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/dask/array/core.py:52: in getarray
    c = a[b]
xarray/core/indexing.py:401: in __getitem__
    return type(self)(self.array[key])
xarray/core/indexing.py:376: in __getitem__
    return type(self)(self.array, self._updated_key(key))
xarray/core/indexing.py:354: in _updated_key
    for size, k in zip(self.array.shape, self.key):
xarray/core/indexing.py:364: in shape
    for size, k in zip(self.array.shape, self.key):
xarray/core/utils.py:414: in shape
    return self.array.shape
xarray/backends/netCDF4_.py:37: in __getattr__
    return getattr(self.datastore.ds.variables[self.var], attr)
../../anaconda/envs/test_env_xarray35/lib/python3.5/contextlib.py:66: in __exit__
    next(self.gen)
xarray/backends/h5netcdf_.py:105: in ensure_open
    self.close()
xarray/backends/h5netcdf_.py:190: in close
    _close_ds(self.ds)
xarray/backends/h5netcdf_.py:70: in _close_ds
    find_root(ds).close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5netcdf/core.py:458: in close
    self._h5file.close()
../../anaconda/envs/test_env_xarray35/lib/python3.5/site-packages/h5py/_hl/files.py:302: in close
    self.id.close()
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
    ???
h5py/h5f.pyx:282: in h5py.h5f.FileID.close (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/h5f.c:3905)
    ???
h5py/_objects.pyx:54: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2840)
    ???
h5py/_objects.pyx:55: in h5py._objects.with_phil.wrapper (/Users/travis/miniconda3/conda-bld/work/h5py-2.6.0/h5py/_objects.c:2798)
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: dictionary changed size during iteration

h5py/_objects.pyx:119: RuntimeError
============================================================================================ 1 failed, 1415 passed, 95 skipped in 116.54 seconds =============================================================================================
Exception ignored in: <function WeakValueDictionary.__init__.<locals>.remove at 0x10f16e598>
Traceback (most recent call last):
  File "/Users/pwolfram/anaconda/envs/test_env_xarray35/lib/python3.5/weakref.py", line 117, in remove
TypeError: 'NoneType' object is not callable

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch from 73b601d to 923473c Compare February 5, 2017 05:39
@shoyer
Copy link
Member

shoyer commented Feb 5, 2017 via email

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch 3 times, most recently from 6c40f49 to 3ddf9c1 Compare March 22, 2017 19:25
@shoyer
Copy link
Member

shoyer commented Mar 22, 2017 via email

@pwolfram
Copy link
Contributor Author

@shoyer, if we generally cover test_backends for autoclose=True, then we should get the pickle testing for free:

xarray/tests/test_backends.py:181:    def test_pickle(self):
xarray/tests/test_backends.py:191:    def test_pickle_dataarray(self):
xarray/tests/test_backends.py:792:    def test_bytesio_pickle(self):

or was there some other test that is needed?

@shoyer
Copy link
Member

shoyer commented Mar 22, 2017

@shoyer, if we generally cover test_backends for autoclose=True, then we should get the pickle testing for free

Agreed, that should do it.

@pwolfram
Copy link
Contributor Author

@shoyer, that subclass-based approach you outlined worked (fixture parameters really don't work with classes as far as I could tell). We now have more comprehensive, named testing. Note, there was one minor point that required more explicitly specification that arose from the more rigorous testing:

The scipy backend can handle objects like BytesIO that really aren't file handles and there doesn't appear to be a clean way to close these types of objects. So, at present I'm explicitly setting _autoclose=False if they are encountered in the datastore. If this needs to be changed, particularly since it doesn't affect existing behavior, I'd prefer this be resolved in a separate issue / PR if possible.

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch from beef5af to 8f2fb8c Compare March 23, 2017 17:04
@pwolfram
Copy link
Contributor Author

@shoyer, I had a minor bug that is now removed. The last caveat no longer applicable:

The scipy backend can handle objects like BytesIO that really aren't file handles and there doesn't appear to be a clean way to close these types of objects. So, at present I'm explicitly setting _autoclose=False if they are encountered in the datastore. If this needs to be changed, particularly since it doesn't affect existing behavior, I'd prefer this be resolved in a separate issue / PR if possible.

I'll let you know when tests pass and this is ready for your final review.

@pwolfram pwolfram force-pushed the fix_too_many_open_files branch 4 times, most recently from a531b10 to 8f2fb8c Compare March 23, 2017 17:16
@pwolfram
Copy link
Contributor Author

@shoyer, this is ready for the final review now. Coveralls appears to have hung but other tests pass.

@pwolfram
Copy link
Contributor Author

@shoyer, all tests (including coveralls) passed. Please let me know if you have additional concerns and if we could merge fairly soon, e.g., because of MPAS-Dev/MPAS-Analysis#151 I would really appreciate it.

from .netCDF4_ import (_nc4_group, _nc4_values_and_dtype,
_extract_nc4_variable_encoding, BaseNetCDF4Array)


class H5NetCDFFArrayWrapper(BaseNetCDF4Array):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spelling: extra F

@@ -96,28 +110,38 @@ def __init__(self, filename_or_obj, mode='r', format=None, group=None,
raise ValueError('invalid format for scipy.io.netcdf backend: %r'
% format)

# if the string ends with .gz, then gunzip and open as netcdf file
if type(filename_or_obj) is str and filename_or_obj.endswith('.gz'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use isinstance(filename_or_obj, basestring) instead of type(filename_or_obj) is str.

Also, move this logic into _open_scipy_netcdf -- otherwise it won't work to reopen a gzipped file.

version=version)
except TypeError as e:
# TODO: gzipped loading only works with NetCDF3 files.
if 'is not a valid NetCDF 3 file' in e.message:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only be triggered when reading a gzipped file. Right now, it can be triggered whenever an invalid netCDF3 file is read, gzipped or not.

This should be easier to fix when you move gzip.open into this helper function (see my comment below).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out-- sorry for this sloppiness.

# autoclose = True

class OpenMFDatasetTest(TestCase):
autoclose = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this class variable, since this test class does not use inheritance.

#class H5NetCDFDataTestAutocloseTrue(H5NetCDFDataTest):
# autoclose = True

class OpenMFDatasetTest(TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something like OpenMFDatasetManyFilesTest (we have other tests of open_mfdataset)

@@ -1139,6 +1257,8 @@ def test_dataarray_compute(self):
self.assertTrue(computed._in_memory)
self.assertDataArrayAllClose(actual, computed)

class DaskTestAutocloseTrue(DaskTest):
autoclose=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run some sort of PEP8 check, e.g., git diff upstream/master | flake8 --diff. There should be extra spaces around the = sign here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to ignore PEP8 to xarray/core/pycompat.py because code in there is essentially a copy / paste.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good


# check that calculation on opened datasets works properly
ds = open_mfdataset(tmpfiles, engine=readengine,
autoclose=self.autoclose)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just set autoclose=True here.

# split into multiple sets of temp files
for ii in original.x.values:
(
original.isel(x=slice(ii, ii+1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be slightly more readable with an intermediate variable, or removing the line break on the line above.

Includes testing to demonstrate an OSError associated
with opening too many files as encountered
using open_mfdataset.

Fixed for the following backends:
 * netCDF4 backend
 * scipy backend
 * pynio backend

Open/close operations on h5netcdf appear to have an
error associated with the h5netcdf library following
correspondence with @shoyer.
Thus, there are still challenges with h5netcdf;
hence, support for h5netcdf is currently disabled.

Note, by default `autoclose=False` for open_mfdataset so standard
behavior is unchanged unless `autoclose=True`.

This choice of default is to select standard xarray performance over
general removal of the OSError associated with opening too many files as
encountered using open_mfdataset.
@pwolfram pwolfram force-pushed the fix_too_many_open_files branch from 8f2fb8c to 20c5c3b Compare March 23, 2017 18:50
@shoyer
Copy link
Member

shoyer commented Mar 23, 2017

OK, in it goes!

@shoyer shoyer merged commit 371d034 into pydata:master Mar 23, 2017
@pwolfram
Copy link
Contributor Author

Thanks a bunch @shoyer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants