Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xarray crashes when opening too many files are opened #49

Closed
xylar opened this issue Nov 29, 2016 · 10 comments
Closed

Xarray crashes when opening too many files are opened #49

xylar opened this issue Nov 29, 2016 · 10 comments

Comments

@xylar
Copy link
Collaborator

xylar commented Nov 29, 2016

While testing the ACME script on rhea, @milenaveneziani hit the xarray mfdatasets error: 'too many open files'. This was while trying to open 100 years of monthly files. Not being able to display time series longer than 100 years is a big limitation on ACME analysis.

@xylar
Copy link
Collaborator Author

xylar commented Nov 29, 2016

This issue is related to pydata/xarray#463 and is discussed in #48.

@xylar
Copy link
Collaborator Author

xylar commented Nov 29, 2016

@milenaveneziani and @pwolfram, I wanted to move discussion of this issue out of #48, since it is not directly related to that PR.

@xylar
Copy link
Collaborator Author

xylar commented Nov 29, 2016

Following up on #48 (comment)

@pwolfram wrote:

Thanks @kmpaul for following up with us. Ideally we would fix this issue upstream in xarray / dask but the best path forward is not yet clear to me yet and I need to look into this further.

@kmpaul wrote:

One could argue that this needs to get fixed upstream of xarray / dask, too. Like in the models themselves...or anywhere the data is generated.

@pwolfram and @milenaveneziani, this is a fair point. Is there a reason we don't generate a large number of streams, each with a small number of variables but with the complete time series of those variables? This is what we often want for analysis. Some of our tools (e.g. the paraview extractor) would need to take a list of files, each with separate variables instead of (or perhaps in addition to) separate time stamps. But I don't see any reason not to consider moving in that direction.

I won't be at this week's ACME ice/ocean meeting but maybe this could be brought up for discussion.

@xylar
Copy link
Collaborator Author

xylar commented Dec 7, 2016

@milenaveneziani and @pwolfram, has there been any further discussion of this at LANL? Is there a possibility of storing yearly output files instead of monthly, for example? This would be one of the easiest ways around the problem.

@milenaveneziani
Copy link
Collaborator

@xylar: that is a possibility, although we have to be careful that we don't create too large output files. Should be OK with sea-ice output, but with mpas-o 3d variables it may be less desirable depending on the model resolution.
Another short-term fix I thought of for sea-ice is to plot subsets of data at a time: we will want to do this anyway because too many seasonal cycles will clutter the plots and render them useless.

@xylar
Copy link
Collaborator Author

xylar commented Dec 7, 2016

@milenaveneziani,

Should be OK with sea-ice output, but with mpas-o 3d variables it may be less desirable depending on the model resolution.

A solution there might be to break streams into smaller sub-streams. I think this is rather easy to do -- I've created new streams for land-ice variables, for example. For analysis, it is far better to have many time indices and few variables in a given file than our current layout. (For writing files from the model, presumably the opposite is true.) As I understand it, this is the essence of @kmpaul's suggestion. In the absence of such an approach, we will presumably need a tool like PyReshaper, but that adds an extra step that would be nice to avoid.

Another short-term fix I thought of for sea-ice is to plot subsets of data at a time: we will want to do this anyway because too many seasonal cycles will clutter the plots and render them useless.

This sounds like a reasonable alternative. That is where #48 will actually be important. It should also be possible to read 100 years of data, compute with it, and store it a temporary array, then do the same wiht the next 100 years of data. But that'll require some non-trivial editing.

@pwolfram
Copy link
Contributor

@xylar and @milenaveneziani pydata/xarray#1198 should fix this issue provided that use of open_mfdataset uses the argument autoclose=True. I would like to test this with our data prior to it being merged in xarray. Do either of you recall a good case that I can use that exposed the OS error related to too many open files?

@milenaveneziani
Copy link
Collaborator

Thanks @pwolfram: this is great!
Yes, a great case is the ACME beta0 simulation. This is the full path on edison:
/scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/
This was run for 255 years, so you have plenty of files to open there. I was getting the error at ~100 files or more.

@milenaveneziani
Copy link
Collaborator

Addressed by pydata/xarray#1198 and #151.

@milenaveneziani
Copy link
Collaborator

we can always re-open if something else needs to be done after new xarray release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants