Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_vars option added to open_mfdataset #1580

Merged
merged 29 commits into from
Oct 10, 2017
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
180cf58
add data_vars option to open_mfdataset
guziy Sep 19, 2017
6195fcd
use single quotes
guziy Sep 19, 2017
956fbeb
fix the 'line too long' warning from flake8
guziy Sep 19, 2017
e721620
document the data_vars keyword for open_mfdataset
guziy Sep 19, 2017
34b1004
improve the data_vars record in whats-new
guziy Sep 19, 2017
09d25c6
update my name in wats-new.rst
guziy Sep 19, 2017
e901a37
Start writing the test for the data_vars keyword
guziy Sep 19, 2017
3141ce4
use the data_vars keyword in combine
guziy Sep 20, 2017
8319aa7
address flake8 warnings for test_backend.py
guziy Sep 20, 2017
fdc940e
ignore flake8 warnings concerning whats-new.rst
guziy Sep 20, 2017
96e842e
fix function reference in whats-new.rst
guziy Sep 20, 2017
b033bec
open_mfdataset does not accept dim keyword argument
guziy Sep 20, 2017
b854ce4
use single quotes for strings in the added tests
guziy Sep 20, 2017
787a98b
refactor data_vars related tests
guziy Sep 20, 2017
4d3c685
Use with for opening mfdataset in data_vars related tests
guziy Sep 20, 2017
1823ba3
add @requires_scipy_or_netCDF4 to the data_vars test class
guziy Sep 20, 2017
b47e665
address flake8 warnings about long lines in the data_vars related tests.
guziy Sep 20, 2017
23f0fc6
close opened datasets in case of a ValueError in open_mfdataset, seem…
guziy Sep 20, 2017
05c8391
fix line too long warnings from flake8
guziy Sep 20, 2017
1f0e763
refactor tests and open_mfdataset, to address comments
guziy Sep 21, 2017
fadda83
refactor tests for data_vars keyword in open_mfdataset
guziy Sep 21, 2017
f80fe1f
refactor to address flake8 warnings
guziy Sep 21, 2017
14dee9d
add another example of data_vars usage in open_mfdataset
guziy Sep 21, 2017
f1f9d8b
add coords keyword to open_mfdataset
guziy Sep 21, 2017
f64c9e3
add a memory and performance related observations to the whats-new an…
guziy Sep 21, 2017
633eec3
fixed a grammar mistake
guziy Sep 21, 2017
086cf25
quote variable names referenced in the text
guziy Sep 21, 2017
b0ca228
add tests for coords keyword in the open_mfdataset, along with the si…
guziy Sep 22, 2017
e463e37
split a test into 2 to simplify, introduce context manager for settin…
guziy Sep 23, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,19 @@ Backward Incompatible Changes

Enhancements
~~~~~~~~~~~~
- Support for data_vars keyword added to
:py:func:`~xarray.open_mfdataset`
(:issue:`438`):

.. ipython::
:verbatim:
#allows to open multiple files as
ds = xarray.open_mfdataset(paths, chunks={"time": 100}, data_vars="minimal")
#instead of
ds = xarray.concat([xarray.open_dataset(p, chunks={"time": 100}) for p in paths], data_vars="minimal", dim="time")
# in the cases when they contain the same coordinate variables that should not be concantenated (i.e lon, lat)

By `Oleksandr Huziy <https://github.com/guziy>`_.

- Support for `pathlib.Path` objects added to
:py:func:`~xarray.open_dataset`, :py:func:`~xarray.open_mfdataset`,
Expand Down
34 changes: 27 additions & 7 deletions xarray/backends/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,7 @@ def close(self):

def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
compat='no_conflicts', preprocess=None, engine=None,
lock=None, **kwargs):
lock=None, data_vars='all', **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, would it also make sense to pass on the coords option at this time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @shoyer:

I have added the coords keyword in a similar manner as data_vars.

I'll probably have to add a test for it as well.

Cheers

"""Open multiple files as a single dataset.

Requires dask to be installed. Attributes from the first dataset file
Expand Down Expand Up @@ -487,6 +487,18 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
default, a per-variable lock is used when reading data from netCDF
files with the netcdf4 and h5netcdf engines to avoid issues with
concurrent access when using dask's multithreaded backend.
data_vars : {'minimal', 'different', 'all' or list of str}, optional
These data variables will be concatenated together:
* 'minimal': Only data variables in which the dimension already
appears are included.
* 'different': Data variables which are not equal (ignoring
attributes) across all datasets are also concatenated (as well as
all for which dimension already appears). Beware: this option may
load the data payload of data variables into memory if they are not
already loaded.
* 'all': All data variables will be concatenated.
* list of str: The listed data variables will be concatenated, in
addition to the 'minimal' data variables.
**kwargs : optional
Additional arguments passed on to :py:func:`xarray.open_dataset`.

Expand Down Expand Up @@ -516,12 +528,20 @@ def open_mfdataset(paths, chunks=None, concat_dim=_CONCAT_DIM_DEFAULT,
if preprocess is not None:
datasets = [preprocess(ds) for ds in datasets]

if concat_dim is _CONCAT_DIM_DEFAULT:
combined = auto_combine(datasets, compat=compat)
else:
combined = auto_combine(datasets, concat_dim=concat_dim, compat=compat)
combined._file_obj = _MultiFileCloser(file_objs)
combined.attrs = datasets[0].attrs
# close datasets in case of a ValueError
try:
if concat_dim is _CONCAT_DIM_DEFAULT:
combined = auto_combine(datasets, compat=compat,
data_vars=data_vars)
else:
combined = auto_combine(datasets, concat_dim=concat_dim,
compat=compat, data_vars=data_vars)
combined._file_obj = _MultiFileCloser(file_objs)
combined.attrs = datasets[0].attrs
except ValueError as ve:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's only wrap the lines where this could fail -- so this should be moved up two lines, before combined._file_obj is assigned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use except ValueError: here and a plain raise below.

for ds in datasets:
ds.close()
raise ve

return combined

Expand Down
11 changes: 7 additions & 4 deletions xarray/core/combine.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ def _dataarray_concat(arrays, dim, data_vars, coords, compat,
return arrays[0]._from_temp_dataset(ds, name)


def _auto_concat(datasets, dim=None):
def _auto_concat(datasets, dim=None, data_vars='all'):
if len(datasets) == 1:
return datasets[0]
else:
Expand All @@ -331,15 +331,15 @@ def _auto_concat(datasets, dim=None):
'supply the ``concat_dim`` argument '
'explicitly')
dim, = concat_dims
return concat(datasets, dim=dim)
return concat(datasets, dim=dim, data_vars=data_vars)


_CONCAT_DIM_DEFAULT = '__infer_concat_dim__'


def auto_combine(datasets,
concat_dim=_CONCAT_DIM_DEFAULT,
compat='no_conflicts'):
compat='no_conflicts', data_vars='all'):
"""Attempt to auto-magically combine the given datasets into one.

This method attempts to combine a list of datasets into a single entity by
Expand Down Expand Up @@ -380,6 +380,8 @@ def auto_combine(datasets,
- 'no_conflicts': only values which are not null in both datasets
must be equal. The returned dataset then contains the combination
of all non-null values.
data_vars : {'minimal', 'different', 'all' or list of str}, optional
Details in the documentation of xarray.concat

Returns
-------
Expand All @@ -395,7 +397,8 @@ def auto_combine(datasets,
dim = None if concat_dim is _CONCAT_DIM_DEFAULT else concat_dim
grouped = itertoolz.groupby(lambda ds: tuple(sorted(ds.data_vars)),
datasets).values()
concatenated = [_auto_concat(ds, dim=dim) for ds in grouped]
concatenated = [_auto_concat(ds, dim=dim, data_vars=data_vars)
for ds in grouped]
else:
concatenated = datasets
merged = merge(concatenated, compat=compat)
Expand Down
127 changes: 127 additions & 0 deletions xarray/tests/test_backends.py
Original file line number Diff line number Diff line change
Expand Up @@ -1267,6 +1267,133 @@ def test_4_open_large_num_files_h5netcdf(self):
self.validate_open_mfdataset_large_num_files(engine=['h5netcdf'])


@requires_scipy_or_netCDF4
class OpenMFDatasetDataVarsKWTest(TestCase):
coord_name = 'lon'
var_name = 'v1'

def gen_datasets_with_common_coord_and_time(self):
# create coordinate data
nx = 10
nt = 10
x = np.arange(nx)
t1 = np.arange(nt)
t2 = np.arange(nt, 2 * nt, 1)

v1 = np.random.randn(nt, nx)
v2 = np.random.randn(nt, nx)

ds1 = Dataset(data_vars={self.var_name: (['t', 'x'], v1),
self.coord_name: ('x', 2 * x)},
coords={
't': (['t', ], t1),
'x': (['x', ], x)
})

ds2 = Dataset(data_vars={self.var_name: (['t', 'x'], v2),
self.coord_name: ('x', 2 * x)},
coords={
't': (['t', ], t2),
'x': (['x', ], x)
})

return ds1, ds2

def test_open_mfdataset_does_same_as_concat(self):
with create_tmp_file() as tmpfile1:
with create_tmp_file() as tmpfile2:
ds1, ds2 = self.gen_datasets_with_common_coord_and_time()

# save data to the temporary files
ds1.to_netcdf(tmpfile1)
ds2.to_netcdf(tmpfile2)

files = [tmpfile1, tmpfile2]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this shared logic in a context manager?

e.g.,

@contextlib.contextmanager
def setup_files(self):
      with create_tmp_file() as tmpfile1:
            with create_tmp_file() as tmpfile2:
                ds1, ds2 = self.gen_datasets_with_common_coord_and_time()

                # save data to the temporary files
                ds1.to_netcdf(tmpfile1)
                ds2.to_netcdf(tmpfile2)

                yield [tmpfile1, tmpfile2]

def test_open_mfdataset_does_same_as_concat(self): 
    with self.setup_files() as files:
        ...

setUp/tearDown methods would also work, with ExitStack.enter_context() and .close().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shoyer:

I like the contextmanager trick a lot. I did feel like there should be a better way to set up tests. Actually, I have never used it before.

Cheers

for opt in ['all', 'minimal']:
with open_mfdataset(files, data_vars=opt) as ds:
kwargs = dict(data_vars=opt, dim='t')
ds_expect = xr.concat([ds1, ds2], **kwargs)

data = ds[self.var_name][:]
data_expect = ds_expect[self.var_name][:]

coord = ds[self.coord_name][:]
coord_expect = ds_expect[self.coord_name][:]

self.assertArrayEqual(data, data_expect)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make use of self.assertDatasetIdentical() to shorten up these tests a bit?

self.assertArrayEqual(coord, coord_expect)

def test_common_coord_dims_should_change_when_datavars_all(self):
with create_tmp_file() as tmpfile1:
with create_tmp_file() as tmpfile2:
ds1, ds2 = self.gen_datasets_with_common_coord_and_time()

# save data to the temporary files
ds1.to_netcdf(tmpfile1)
ds2.to_netcdf(tmpfile2)

files = [tmpfile1, tmpfile2]
# open the files with the default data_vars='all'
with open_mfdataset(files, data_vars='all') as ds:

coord_shape = ds[self.coord_name].shape
coord_shape1 = ds1[self.coord_name].shape
coord_shape2 = ds2[self.coord_name].shape

var_shape = ds[self.var_name].shape
var_shape1 = ds1[self.var_name].shape
var_shape2 = ds2[self.var_name].shape

self.assertNotEqual(coord_shape1, coord_shape)
self.assertNotEqual(coord_shape2, coord_shape)

self.assertEqual(var_shape[0],
var_shape1[0] + var_shape2[0])

self.assertEqual(var_shape, coord_shape)

def test_common_coord_dims_should_not_change_when_datavars_minimal(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very similar to the last test -- can you maybe consolidate it?

Or you could even potentially drop some of these tests. We have unit tests for concat and open_mfdataset already, so the main thing we need to verify is that the keyword argument gets properly passed on. We don't need to check here that every possible way to use it is handled correctly.

with create_tmp_file() as tmpfile1:
with create_tmp_file() as tmpfile2:
ds1, ds2 = self.gen_datasets_with_common_coord_and_time()

# save data to the temporary files
ds1.to_netcdf(tmpfile1)
ds2.to_netcdf(tmpfile2)

files = [tmpfile1, tmpfile2]
# open the files with the default data_vars='all'
with open_mfdataset(files, data_vars='minimal') as ds:

coord_shape = ds[self.coord_name].shape
coord_shape1 = ds1[self.coord_name].shape
coord_shape2 = ds2[self.coord_name].shape

var_shape = ds[self.var_name].shape
var_shape1 = ds1[self.var_name].shape
var_shape2 = ds2[self.var_name].shape

self.assertEqual(coord_shape1, coord_shape)

self.assertEqual(coord_shape2, coord_shape)
self.assertEqual(var_shape[0],
var_shape1[0] + var_shape2[0])

def test_invalid_data_vars_value_should_fail(self):
with self.assertRaises(ValueError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to only go around the line where you expect the error

with create_tmp_file() as tmpfile1:
with create_tmp_file() as tmpfile2:
ds1, ds2 = self.gen_datasets_with_common_coord_and_time()

# save data to the temporary files
ds1.to_netcdf(tmpfile1)
ds2.to_netcdf(tmpfile2)

files = [tmpfile1, tmpfile2]
with open_mfdataset(files, data_vars='minimum'):
pass


@requires_dask
@requires_scipy
@requires_netCDF4
Expand Down