Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Too many open files when opening large datasets #102

Closed
JanisGailis opened this issue Dec 13, 2016 · 8 comments
Closed

Too many open files when opening large datasets #102

JanisGailis opened this issue Dec 13, 2016 · 8 comments
Assignees
Labels

Comments

@JanisGailis
Copy link
Member

When opening large datasets that consist of 1000+ files using the 'esa_cci_odp' data store, the software crashes with an OSError.

>>> from cate.core.ds import DATA_STORE_REGISTRY
>>> from cate.core.monitor import ConsoleMonitor
>>> import cate.ops as ops
>>> monitor = ConsoleMonitor()
>>> data_store = DATA_STORE_REGISTRY.get_data_store('esa_cci_odp')
>>> sst = ops.open_dataset('esacci.SST.day.L4.SSTDepth.multi-sensor.multi-platform.OSTIA.1-1.r1','2000-01-01','2002-12-31',sync=True, monitor=monitor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ccitbx/Development/cate-core/cate/ops/io.py", line 53, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/core/ds.py", line 396, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/ds/esa_cci_odp.py", line 510, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/core/ds.py", line 413, in open_xarray_dataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 300, in open_mfdataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 300, in <listcomp>
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 210, in open_dataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 188, in __init__
  File "netCDF4/_netCDF4.pyx", line 1811, in netCDF4._netCDF4.Dataset.__init__ (netCDF4/_netCDF4.c:12262)
OSError: Too many open files

There seems to be an open xarray issue on this.
pydata/xarray#463

@JanisGailis
Copy link
Member Author

It seems that the file limit is global, I can open datasets consisting of X files separately, but not one after another.

>>> from cate.core.ds import DATA_STORE_REGISTRY      
>>> from cate.core.monitor import ConsoleMonitor
>>> import cate.ops as ops
>>> monitor = ConsoleMonitor()
>>> data_store = DATA_STORE_REGISTRY.get_data_store('esa_cci_odp')
>>> sm = ops.open_dataset('esacci.SOILMOISTURE.day.L3S.SSMV.multi-sensor.multi-platform.COMBINED.02-2.r1','2000-01-01','2001-12-31', sync=True, monitor=monitor)
>>> sst = ops.open_dataset('esacci.SST.day.L4.SSTdepth.multi-sensor.multi-platform.OSTIA.1-1.r1','2000-01-01','2001-12-31', sync=True, monitor=monitor)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ccitbx/Development/cate-core/cate/ops/io.py", line 53, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/core/ds.py", line 396, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/ds/esa_cci_odp.py", line 510, in open_dataset
  File "/home/ccitbx/Development/cate-core/cate/core/ds.py", line 413, in open_xarray_dataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 300, in open_mfdataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 300, in <listcomp>
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/api.py", line 210, in open_dataset
  File "/home/ccitbx/miniconda3/envs/ect_env/lib/python3.5/site-packages/xarray-0.8.2-py3.5.egg/xarray/backends/netCDF4_.py", line 188, in __init__
  File "netCDF4/_netCDF4.pyx", line 1811, in netCDF4._netCDF4.Dataset.__init__ (netCDF4/_netCDF4.c:12262)
OSError: Too many open files

@JanisGailis
Copy link
Member Author

See also #118

@pwolfram
Copy link

@JanisGailis, can you please see if pydata/xarray#1198 fixes your problem above?

@JanisGailis
Copy link
Member Author

@pwolfram Yes it does, great work! These are really good news for us.
The following test:

from cate.core.ds import DATA_STORE_REGISTRY
from cate.util.monitor import ConsoleMonitor
import cate.ops as ops
monitor = ConsoleMonitor()
sst = ops.open_dataset('esacci.SST.day.L4.SSTDepth.multi-sensor.multi-platform.OSTIA.1-1.r1','2000-01-01','2002-12-31',sync=True, monitor=monitor)
print(sst)

sm = ops.open_dataset('esacci.SOILMOISTURE.day.L3S.SSMV.multi-sensor.multi-platform.COMBINED.02-2.r1','2000-01-01','2001-12-31', sync=True, monitor=monitor)
sst = ops.open_dataset('esacci.SST.day.L4.SSTdepth.multi-sensor.multi-platform.OSTIA.1-1.r1','2000-01-01','2001-12-31', sync=True, monitor=monitor)
print(sm)
print(sst)

sm = ops.open_dataset('esacci.SOILMOISTURE.day.L3S.SSMV.multi-sensor.multi-platform.COMBINED.02-2.r1','2000-01-01','2003-12-31', sync=True, monitor=monitor)
print(sm)

yields:

<xarray.Dataset>
Dimensions:                 (bnds: 2, lat: 3600, lon: 7200, time: 1096)
Coordinates:
  * lat                     (lat) float32 -89.975 -89.925 -89.875 -89.825 ...
  * lon                     (lon) float32 -179.975 -179.925 -179.875 ...
  * time                    (time) datetime64[ns] 2000-01-01T12:00:00 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds               (time, bnds) datetime64[ns] 2000-01-01 ...
    lat_bnds                (time, lat, bnds) float32 -90.0 -89.95 -89.95 ...
    lon_bnds                (time, lon, bnds) float32 -180.0 -179.95 -179.95 ...
    analysed_sst            (time, lat, lon) float64 nan nan nan nan nan nan ...
    analysis_error          (time, lat, lon) float64 nan nan nan nan nan nan ...
    sea_ice_fraction        (time, lat, lon) float64 nan nan nan nan nan nan ...
    sea_ice_fraction_error  (time, lat, lon) float64 nan nan nan nan nan nan ...
    mask                    (time, lat, lon) float64 2.0 2.0 2.0 2.0 2.0 2.0 ...
<xarray.Dataset>
Dimensions:         (lat: 720, lon: 1440, time: 731)
Coordinates:
  * lon             (lon) float32 -179.875 -179.625 -179.375 -179.125 ...
  * lat             (lat) float32 89.875 89.625 89.375 89.125 88.875 88.625 ...
  * time            (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
    t0              (time, lat, lon) datetime64[ns] NaT NaT NaT NaT NaT NaT ...
    sm              (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    sm_uncertainty  (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    dnflag          (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    flag            (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    freqband        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    mode            (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    sensor          (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
<xarray.Dataset>
Dimensions:                 (bnds: 2, lat: 3600, lon: 7200, time: 731)
Coordinates:
  * lat                     (lat) float32 -89.975 -89.925 -89.875 -89.825 ...
  * lon                     (lon) float32 -179.975 -179.925 -179.875 ...
  * time                    (time) datetime64[ns] 2000-01-01T12:00:00 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds               (time, bnds) datetime64[ns] 2000-01-01 ...
    lat_bnds                (time, lat, bnds) float32 -90.0 -89.95 -89.95 ...
    lon_bnds                (time, lon, bnds) float32 -180.0 -179.95 -179.95 ...
    analysed_sst            (time, lat, lon) float64 nan nan nan nan nan nan ...
    analysis_error          (time, lat, lon) float64 nan nan nan nan nan nan ...
    sea_ice_fraction        (time, lat, lon) float64 nan nan nan nan nan nan ...
    sea_ice_fraction_error  (time, lat, lon) float64 nan nan nan nan nan nan ...
    mask                    (time, lat, lon) float64 2.0 2.0 2.0 2.0 2.0 2.0 ...
<xarray.Dataset>
Dimensions:         (lat: 720, lon: 1440, time: 1461)
Coordinates:
  * lon             (lon) float32 -179.875 -179.625 -179.375 -179.125 ...
  * lat             (lat) float32 89.875 89.625 89.375 89.125 88.875 88.625 ...
  * time            (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
    t0              (time, lat, lon) datetime64[ns] NaT NaT NaT NaT NaT NaT ...
    sm              (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    sm_uncertainty  (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    dnflag          (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    flag            (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    freqband        (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    mode            (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...
    sensor          (time, lat, lon) float64 nan nan nan nan nan nan nan nan ...

All in all it opened ~4k files. Also, these are pretty 'difficult' datasets with high compression (SST uncompresses to 1GB per file), with a lot of CF decoding to do (NaN and stuff). Also, I have to mention that it seemed to me that open_mfdataset from the master branch works faster than the one in xarray 9.0.1. Maybe it's just an impression, didn't do any tests! When can we expect 9.0.2? :)

All in all, great job! Thanks a lot!

@mzuehlke
Copy link
Collaborator

Good news. Thanks @pwolfram for fixing this and @JanisGailis for testing.

@pwolfram
Copy link

Thanks obviously go to @shoyer too who provided clutch help! Can this issue be closed now @mzuehlke and @JanisGailis?

@JanisGailis
Copy link
Member Author

@pwolfram The xarray issue, sure!

This one I guess we'll close when we'll have bumped xarray version!

@JanisGailis
Copy link
Member Author

Fixed upstream

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants