Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attributes from netCDF4 intialization retained #1038

Merged
merged 1 commit into from
Mar 31, 2017

Conversation

pwolfram
Copy link
Contributor

@pwolfram pwolfram commented Oct 4, 2016

Ensures that attrs for open_mfdataset are now retained

cc @shoyer

@shoyer
Copy link
Member

shoyer commented Oct 5, 2016

Merge logic for attributes opens a whole big can of worms. I would probably just copy attributes from the first dataset (similar to what we do in concat), unless you want to overhaul the whole thing in a more comprehensive fashion.

@pwolfram
Copy link
Contributor Author

pwolfram commented Oct 5, 2016

@shoyer, it sounds like provenance of data is an outstanding problem long-term. I'm happy to just copy attributes from the first dataset but am wondering what it would take to do this correctly, i.e., the "overhaul". Any information you have on this would be really helpful. At a minimum we can do as you suggest to fix the lack of attributes (#1037).

@pwolfram
Copy link
Contributor Author

pwolfram commented Oct 5, 2016

@shoyer, I did some more digging and see some of the potential issues because some of the concatenation / merging is done quasi-automatically, which reduces the number of objects that must be merged (e.g., https://github.com/pydata/xarray/blob/master/xarray/core/combine.py#L391). I'm assuming this is done for performance / simplicity. Is that true?

This is looking like a much larger piece of work as I look at this further because the information has already been compressed by the time the merge is called (i.e., len(dict_like_objects) is not necessarily equal to the number of input files https://github.com/pydata/xarray/blob/master/xarray/core/merge.py#L531).

@shoyer
Copy link
Member

shoyer commented Oct 5, 2016

I did some more digging and see some of the potential issues because some of the concatenation / merging is done quasi-automatically, which reduces the number of objects that must be merged (e.g., https://github.com/pydata/xarray/blob/master/xarray/core/combine.py#L391). I'm assuming this is done for performance / simplicity. Is that true?

We have two primitive combine operations, concat (same variables, different coordinate values) and merge (different variables, same coordinate values). auto_combine needs to do both in some order.

You're right that the order of grouped is not deterministic (it uses a dict). Sorting by key for input into the list comprehension could fix that.

The comprehensive fix would be to pick a merge strategy for attributes, and apply it uniformly in each place where xarray merges variables or datasets (basically, in concat and all the merge variations). Possibly several merge strategies, with a keyword argument to switch between them.

@fmaussion
Copy link
Member

AFAIC I'd be happy with a combined.attrs = datasets[0].attrs added before returning the combined dataset which is already better than the current situation...

Do you have time to get back to this @pwolfram ?

@pwolfram
Copy link
Contributor Author

@fmaussion and @shoyer, I'd like to close this PR out if possible. I'm not 100% sure this PR is worthwhile to complete in a general fashion because of the ambiguity in how to best handle this issue. My current take on this would be to go with whatever is simplest / cleanest, at least in the short term, which is @fmaussion's suggestion above. Does this work for you both?

@fmaussion
Copy link
Member

Yes, that's good for me. I would mention it somewhere in the docstring though.

@pwolfram
Copy link
Contributor Author

Note, I would say that open_mfdataset is no longer experimental because of its widespread use.

@pwolfram
Copy link
Contributor Author

Provided checks pass this should be ready to merge @fmaussion unless @shoyer has any additional recommended changes.

@fmaussion
Copy link
Member

Note, I would say that open_mfdataset is no longer experimental because of its widespread use.

Yes, I also recently updated the IO docs in this respect and removed the experimental part: http://xarray.pydata.org/en/latest/io.html#id6

@shoyer
Copy link
Member

shoyer commented Mar 22, 2017

Yes, this works for me. Can you add a test case that covers this?

@pwolfram pwolfram force-pushed the mfdataset_attrs branch 2 times, most recently from 4d9fd6b to 76978d7 Compare March 24, 2017 16:54
@pwolfram
Copy link
Contributor Author

@shoyer, added a test as requested.

Uses attributes from first file opened by
`open_mfdataset` to populate ds.attrs.
@shoyer
Copy link
Member

shoyer commented Mar 24, 2017

It looks like one of the new many files tests is crashing:

xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_3_open_large_num_files_pynio /home/travis/build.sh: line 62: 1561 Segmentation fault (core dumped) py.test xarray --cov=xarray --cov-report term-missing --verbose

https://travis-ci.org/pydata/xarray/jobs/214722901

@fmaussion
Copy link
Member

Yes, it also happened on this PR: #1328

@pwolfram
Copy link
Contributor Author

It happened here too... I just tried it out on my local machine via conda env create -f ci/requirements-py27-cdat+pynio.yml and wasn't able to get an error... are any of the crashes better then a "seg fault"?

@shoyer
Copy link
Member

shoyer commented Mar 24, 2017

@pwolfram if we're getting sporadic failures on Travis, it's probably better to skip the test by default. It's important for the test suite not be flakey.

@pwolfram
Copy link
Contributor Author

@shoyer, should I do a quick "hot fix" and then try to sort out the problem?

@pwolfram
Copy link
Contributor Author

pwolfram commented Mar 24, 2017

I'm continuing to take a look-- my tests were not 100% set up locally on this branch and I'll see if I can reproduce the sporadic error on macOS.

@pwolfram
Copy link
Contributor Author

Still passing locally...

xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_1_autoclose_netcdf4 PASSED
xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_1_open_large_num_files_netcdf4 PASSED
xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_2_autoclose_scipy PASSED
xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_2_open_large_num_files_scipy PASSED
xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_3_autoclose_pynio PASSED
xarray/tests/test_backends.py::OpenMFDatasetManyFilesTest::test_3_open_large_num_files_pynio PASSED

Test passes even if I run it multiple times too.

@shoyer
Copy link
Member

shoyer commented Mar 24, 2017

Travis is a shared environment that runs multiple tests concurrently. It's possible that we're running out of files due to other users or even other variants of our same build.

@pwolfram
Copy link
Contributor Author

Is it possible that the test fails if more than one is simultaneously run on the same node? Could you restart the other tests to verify (restart at the same time if possible).

@shoyer
Copy link
Member

shoyer commented Mar 24, 2017

Just restarted, let's see...

@pwolfram
Copy link
Contributor Author

Crash in the same place... but when I restarted it via a force push earlier it passed, which would imply we are running out of resources on travis.

Maybe the thing to do is just to do a reset on the open file limit as @rabernat suggested, this way it provides a factor of safety on travis.

Thoughts on this idea @shoyer and @fmaussion?

@pwolfram
Copy link
Contributor Author

See #1336 for a fix that disables these tests that have been acting up because of resource issues.

@pwolfram
Copy link
Contributor Author

@shoyer, tests should be restarted following merge of #1336 and this PR should be ready to merge.

@shoyer
Copy link
Member

shoyer commented Mar 31, 2017

OK, going to merge this anyways... the failing tests will be fixed by #1366

@shoyer shoyer merged commit c0178b7 into pydata:master Mar 31, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants