Xarray crashes when opening too many files are opened #49

xylar · 2016-11-29T19:46:15Z

While testing the ACME script on rhea, @milenaveneziani hit the xarray mfdatasets error: 'too many open files'. This was while trying to open 100 years of monthly files. Not being able to display time series longer than 100 years is a big limitation on ACME analysis.

xylar · 2016-11-29T19:47:13Z

This issue is related to pydata/xarray#463 and is discussed in #48.

xylar · 2016-11-29T19:47:42Z

@milenaveneziani and @pwolfram, I wanted to move discussion of this issue out of #48, since it is not directly related to that PR.

xylar · 2016-11-29T19:50:03Z

Following up on #48 (comment)

@pwolfram wrote:

Thanks @kmpaul for following up with us. Ideally we would fix this issue upstream in xarray / dask but the best path forward is not yet clear to me yet and I need to look into this further.

@kmpaul wrote:

One could argue that this needs to get fixed upstream of xarray / dask, too. Like in the models themselves...or anywhere the data is generated.

@pwolfram and @milenaveneziani, this is a fair point. Is there a reason we don't generate a large number of streams, each with a small number of variables but with the complete time series of those variables? This is what we often want for analysis. Some of our tools (e.g. the paraview extractor) would need to take a list of files, each with separate variables instead of (or perhaps in addition to) separate time stamps. But I don't see any reason not to consider moving in that direction.

I won't be at this week's ACME ice/ocean meeting but maybe this could be brought up for discussion.

xylar · 2016-12-07T21:51:57Z

@milenaveneziani and @pwolfram, has there been any further discussion of this at LANL? Is there a possibility of storing yearly output files instead of monthly, for example? This would be one of the easiest ways around the problem.

milenaveneziani · 2016-12-07T22:27:54Z

@xylar: that is a possibility, although we have to be careful that we don't create too large output files. Should be OK with sea-ice output, but with mpas-o 3d variables it may be less desirable depending on the model resolution.
Another short-term fix I thought of for sea-ice is to plot subsets of data at a time: we will want to do this anyway because too many seasonal cycles will clutter the plots and render them useless.

xylar · 2016-12-07T22:44:23Z

@milenaveneziani,

Should be OK with sea-ice output, but with mpas-o 3d variables it may be less desirable depending on the model resolution.

A solution there might be to break streams into smaller sub-streams. I think this is rather easy to do -- I've created new streams for land-ice variables, for example. For analysis, it is far better to have many time indices and few variables in a given file than our current layout. (For writing files from the model, presumably the opposite is true.) As I understand it, this is the essence of @kmpaul's suggestion. In the absence of such an approach, we will presumably need a tool like PyReshaper, but that adds an extra step that would be nice to avoid.

Another short-term fix I thought of for sea-ice is to plot subsets of data at a time: we will want to do this anyway because too many seasonal cycles will clutter the plots and render them useless.

This sounds like a reasonable alternative. That is where #48 will actually be important. It should also be possible to read 100 years of data, compute with it, and store it a temporary array, then do the same wiht the next 100 years of data. But that'll require some non-trivial editing.

pwolfram · 2017-01-31T20:01:38Z

@xylar and @milenaveneziani pydata/xarray#1198 should fix this issue provided that use of open_mfdataset uses the argument autoclose=True. I would like to test this with our data prior to it being merged in xarray. Do either of you recall a good case that I can use that exposed the OS error related to too many open files?

milenaveneziani · 2017-01-31T23:40:52Z

Thanks @pwolfram: this is great!
Yes, a great case is the ACME beta0 simulation. This is the full path on edison:
/scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/
This was run for 255 years, so you have plenty of files to open there. I was getting the error at ~100 files or more.

milenaveneziani · 2017-03-24T19:52:05Z

Addressed by pydata/xarray#1198 and #151.

milenaveneziani · 2017-03-24T19:53:17Z

we can always re-open if something else needs to be done after new xarray release.

pwolfram mentioned this issue Nov 29, 2016

Summary of key analysis tasks #32

Closed

22 tasks

pwolfram mentioned this issue Jan 10, 2017

Fixes OS error arising from too many files open pydata/xarray#1198

Merged

milenaveneziani closed this as completed Mar 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Xarray crashes when opening too many files are opened #49

Xarray crashes when opening too many files are opened #49

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Dec 7, 2016

milenaveneziani commented Dec 7, 2016

xylar commented Dec 7, 2016

pwolfram commented Jan 31, 2017

milenaveneziani commented Jan 31, 2017

milenaveneziani commented Mar 24, 2017

milenaveneziani commented Mar 24, 2017

Xarray crashes when opening too many files are opened #49

Xarray crashes when opening too many files are opened #49

Comments

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Nov 29, 2016

xylar commented Dec 7, 2016

milenaveneziani commented Dec 7, 2016

xylar commented Dec 7, 2016

pwolfram commented Jan 31, 2017

milenaveneziani commented Jan 31, 2017

milenaveneziani commented Mar 24, 2017

milenaveneziani commented Mar 24, 2017