Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add autoclose argument to xarray.open_mfdataset call #151

Merged
merged 3 commits into from
Mar 24, 2017

Conversation

xylar
Copy link
Collaborator

@xylar xylar commented Mar 23, 2017

If a sufficiently large number of files is to be opened, the autoclose argument to xarray.open_mfdataset (if available) is set to True. This option should prevent crashes with too many files open.

Also, a bit of PEP8 cleanup in the Nino3.4 index and sea ice time series task. (As usual, I couldn't just let it be...)

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

@pwolfram, could you make sure this branch behaves as expected with more than autocloseFileCount files and the new branch of xarray you're trying to get merged? If it does, I think this merge should make us ready for the xarray update when it happens.

The easiest way to test would be to set autocloseFileCount to something small and make sure it really is closing the files (somehow).

@@ -22,8 +22,7 @@
from ..shared.generalized_reader.generalized_reader \
import open_multifile_dataset

from ..shared.timekeeping.utility import get_simulation_start_time, \
date_to_days, days_to_datetime
from ..shared.timekeeping.utility import get_simulation_start_time
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused functions

@@ -80,7 +80,6 @@ def nino34_index(config, streamMap=None, variableMap=None):
mainRunName = config.get('runs', 'mainRunName')
plotsDirectory = buildConfigFullPath(config, 'output', 'plotsSubdirectory')

plotTitles = config.getExpression('regions', 'plotTitles')
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused variable

start = days_to_datetime(np.amin(ds.Time.min()), calendar=calendar)
end = days_to_datetime(np.amax(ds.Time.max()), calendar=calendar)
f, spectra, conf99, conf95, redNoise = compute_nino34_spectra(nino34,
config)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused variables and long line

@@ -122,11 +120,12 @@ def nino34_index(config, streamMap=None, variableMap=None):
def compute_nino34_index(regionSST, config):
# {{{
"""
Computes nino34 index time series. It follow the standard nino34 algorithm, i.e.,
Computes nino34 index time series. It follow the standard nino34
algorithm, i.e.,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long line

3) Performs a 5 month running mean over the anomalies
1. Compute monthly average SST in the region
2. Computes anomalous SST
3. Performs a 5 month running mean over the anomalies
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better output formatting

@@ -136,7 +135,8 @@ def compute_nino34_index(regionSST, config):
regionSST : xarray.dataArray object
values of SST in the nino region

config : instance of the MPAS configParser
config : MpasConfigParser object
the config options
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better comment

@@ -159,7 +159,8 @@ def compute_nino34_index(regionSST, config):
calendar = namelist.get('config_calendar_type')

# Compute monthly average and anomaly of climatology of SST
monthlyClimatology = climatology.compute_monthly_climatology(regionSST, calendar)
monthlyClimatology = climatology.compute_monthly_climatology(regionSST,
calendar)
anomalySST = regionSST.groupby('month') - monthlyClimatology
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long line

variableName,
mainRunName)
variableName,
mainRunName)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below, aligning indentation

variableName)
figureNamePolar = '{}/{}.{}_polar.png'.format(plotsDirectory,
mainRunName,
variableName)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligning indentation and long lines

titleNH,
figureNamePolarNH,
lineStyles=lineStyles,
lineWidths=lineWidths,
titleFontSize=titleFontSize,
calendar=calendar)
timeseries_analysis_plot_polar(config, varsSH, movingAveragePoints,
timeseries_analysis_plot_polar(config, varsSH,
movingAveragePoints,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

long line

variableName)
figureNamePolar = '{}/{}.{}_polar.png'.format(plotsDirectory,
mainRunName,
variableName)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligning indentation and long line

# outside the allowed range
kwargs['decode_times'] = False
# files differ along the Time dimension
kwargs['concat_dim'] = 'Time'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved these here in case a user sets decode_times or concat_dims in the kwargs dictionary. I overwrite them without checking for existing values because the code really won't work if the user sets them to wrong values. I don't think there needs to be a warning about this because setting them explicitly is a bit weird anyway...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the warning would occur if the user tries to set them? I agree this would be a good warning (or error) to raise.

kwargs['concat_dim'] = 'Time'

autoclose = ((autocloseFileCount is not None) and
len(fileNames) > autocloseFileCount)
Copy link
Collaborator Author

@xylar xylar Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults to False because autocloseFileCount=None.

try:
ds = xarray.open_mfdataset(fileNames,
preprocess=preprocess_partial,
autoclose=True, **kwargs)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I don't want to modify kwargs to include autoclose because I will use it later without autoclose if this fails. If autoclose is included in kwargs, this line will fail and the warning below will likely be printed but then it will try again below without explicitly including autoclose, so I think it will work out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intuitive. I would recommend handling all autoclose=True issues in this file and not allow specification of this option outside this file. There should be some options that the reader takes from the config files and if we currently don't have the option to read in the simulation config files in the reader I would argue this is a functionality we ultimately need because it is possible there will be other options like autoclose that really don't belong in analyses individually but are just relevant to the process of reading netCDF files for given platforms.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dude, @pwolfram, you really can be "judgy" about code. "This is not intuitive" is kind of harsh. You mean, I presume, "I would do this differently."

But okay, I will take out kwargs. I will instead pass in the config. This will (at some later point) allow us to support dask chunking which seems like something we might already need for the MOC calculations, for example.

I will support autoclose internally. I can use the heuristic you are suggesting, if you give me some code for it...

openedWithAutoclose = True
except TypeError as e:
if 'autoclose' in str(e):
# This indicates that xarray version doesn't support autoclose
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if autoclose is somewhere in the error message (so we don't print an unhelpful warning when other arguments besides autoclose are the problem).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we need to double check the final merged version of pydata/xarray#1198 to make sure this works. Please feel free to take a quick look if you would like too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this exception handling is dependent on the xarray PR in any way unless you change the name of the argument. If the argument is not supported (e.g. in the current version of xarray) you get a TypeError, which I have already verified is the case.

@xylar xylar added the priority label Mar 23, 2017
Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use autocloseFileCount internally to the reader and optimally compute it internally? I can't think of a clean reason why this should be varied between analyses, although it should be varied between platform. I strongly recommend only including it in the generalized_reader.py file to keep the signature for open_multifile_dataset clean in terms of analysis task usage. Otherwise, we change a lot of lines of code but don't really add too much value because we shouldn't change the usage of open_multifile_dataset between analyses, just computational platforms. The guiding principle here is simpler code for the application layer.

Also, thanks for highlighting formatting changes explicitly in absence of a dedicated commit related to formatting.

# outside the allowed range
kwargs['decode_times'] = False
# files differ along the Time dimension
kwargs['concat_dim'] = 'Time'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the warning would occur if the user tries to set them? I agree this would be a good warning (or error) to raise.

try:
ds = xarray.open_mfdataset(fileNames,
preprocess=preprocess_partial,
autoclose=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not intuitive. I would recommend handling all autoclose=True issues in this file and not allow specification of this option outside this file. There should be some options that the reader takes from the config files and if we currently don't have the option to read in the simulation config files in the reader I would argue this is a functionality we ultimately need because it is possible there will be other options like autoclose that really don't belong in analyses individually but are just relevant to the process of reading netCDF files for given platforms.

openedWithAutoclose = True
except TypeError as e:
if 'autoclose' in str(e):
# This indicates that xarray version doesn't support autoclose
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, we need to double check the final merged version of pydata/xarray#1198 to make sure this works. Please feel free to take a quick look if you would like too.

# This indicates that xarray version doesn't support autoclose
print 'Warning: open_multifile_dataset is trying to use autoclose=True but\n' \
'it appears your xarray version doesn\'t support this argument. Will\n' \
'try again without autoclose argument.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

config.default Outdated
@@ -47,6 +47,10 @@ seaIceStreamsFileName = streams.cice
# names of ocean and sea ice meshes (e.g. EC60to30, QU240, RRS30to10, etc.)
mpasMeshName = mesh

# this is the maximum number of files, beyond which xarray will close files
# after data has been read to prevent errors related to too many open files.
autocloseFileCount = 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should specify this necessarily. It may be better to use something like ulimit -a | grep open | awk '{print $4}' and a heuristic to determine if we need autoclose=True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please propose code for this that I can add to open_multifile_dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is the python equivalent:
https://docs.python.org/2/library/resource.html
So the code would be

resource.getrlimit(resource.RLIMIT_NOFILE)[0]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the heuristic you would suggest? Half this number? This number minus a buffer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see discussion regarding autocloseSF and need to also consider number of tasks that are running too.

@pwolfram
Copy link
Contributor

Don't merge flag corresponds to final issues being worked out for pydata/xarray#1198, which should be merged first.

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

Why not just use autocloseFileCount internally to the reader and optimally compute it internally? I can't think of a clean reason why this should be varied between analyses, although it should be varied between platform. I strongly recommend only including it in the generalized_reader.py file to keep the signature for open_multifile_dataset clean in terms of analysis task usage. Otherwise, we change a lot of lines of code but don't really add too much value because we shouldn't change the usage of open_multifile_dataset between analyses, just computational platforms. The guiding principle here is simpler code for the application layer.

Okay, I could pass the config instead of the autocloseFileCount. It still requires changing every call to open_multifile_dataset, and since it's currently the only thing we would read from config, I debated about how "smart" to try to make open_multifile_dataset. But I agree that it should try to handle as much as it can under the hood. Similarly, we might want it to handle chunking.

If I understand right, you are advocating not allowing kwargs in open_multifile_dataset, which is fine with me.

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

Also, thanks for highlighting formatting changes explicitly in absence of a dedicated commit related to formatting.

It's nearly impossible to do dedicated PRs for these formatting changes because they are a constant battle. So this will be my compromise...

@xylar xylar force-pushed the add_xarray_autoclose branch from accfc29 to b135bc0 Compare March 23, 2017 17:15
@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

@pwolfram, sorry for being grumpy earlier. I guess you're used to it from me by now...

I updated the code, I think to meet your requests. The heuristic I'm currently using to determine the maximum number of files is the softLimit/2 but feel free to suggest another heuristic.

I removed kwargs from open_multifile_dataset so individual tasks don't get to play with that. I added config as an argument in anticipation of wanting to add things like chunking to open_multifile_dataset, though it's not clear to me yet how we would do that...

@xylar xylar force-pushed the add_xarray_autoclose branch from b135bc0 to 9d55ba1 Compare March 23, 2017 17:19
## This file contains the default values of all possible configuration options
## used to run analysis tasks. Do not modify options in this file direct.
## This file contains the default values of all possible configuration options
## used to run analysis tasks. Do not modify options in this file direct.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere in this file, it is just automatically deleting trailing whitespace. No other changes were made.

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

Testing

I ran the full range of tasks on my laptop using both the QU240 and EC60to30 results. I also tested the QU240 test case with autocloseFileLimitFraction = 1e-3, which resulted in warnings about autoclose not being supported, as expected.

@xylar xylar force-pushed the add_xarray_autoclose branch from 8912d31 to d15e7c0 Compare March 23, 2017 20:43
# default, xarray attempts to open all files in a data set simultaneously.
# A new option allows files to be automatically closed as a data set is being
# read to prevent hitting this limit. Here, you can set what fraction of the
# system limit of open files an analysis task is allowed to use. Note: In the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a typo somewhere on the line above? I do not understand the sentence starting with Here, you can set....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, but I'm happy to clarify. autocloseFileLimitFraciton is multiplied by the system limit on the number of files that can be opened. The result is the maximum number of files that we allow xarray to open before we use autoclose. Would the following be clearer? "Here, you can set the fraction of the system limit that xarray is allowed to use to open data sets at any given time. " If not, please suggest an alternative wording.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. Yes the new wording is clearer.

try:
ds = xarray.open_mfdataset(fileNames,
preprocess=preprocess_partial,
autoclose=autoclose, **kwargs)
Copy link
Contributor

@pwolfram pwolfram Mar 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update date stamp on this function (line 116 above).

Copy link
Contributor

@pwolfram pwolfram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xylar, if we could test loading of a dataset and performing computations on it, e.g., a climatology, with autoclose=False and autoclose=True that would be really good to get early-warning notification via CI if something gets broken (e.g., the name for autoclose was changed for instance). Essentially, this protects this work against upstream breaks in API.

@pwolfram
Copy link
Contributor

@xylar, a few more requests following another read-through on this code, especially related to adding enhanced testing to make sure both autoclose cases behave as expected. One idea would be to compute a simple climatology, varying autoclose, and make sure the computed result is identical.

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

@pwolfram, I will agree to add some CI. Given how we set this up, it's going to be tricky to test autoclose=True and autoclose=False directly since we don't allow access to them, unless you're suggesting not using open_multifile_dataset, which I then am not in favor of.

I don't think we have an appropriate data set for CI testing that would be able to test reading from multiple files and something like performing a climatology. Even the lower res data sets we have are too big for that. Creating fake data sets or stripping out data from existing files for CI is very tedious but I'll do it if it's really a good use of my time...

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

I suppose it could be a good way to test climatology as well as the generalized reader.

@pwolfram
Copy link
Contributor

I was thinking you'd have a function that you run to cache the result. You'd change the config option for autocloseFileLimitFraction=0 so that autoclose=True and likewise autocloseFileLimitFraction=1e10 or some other large number so that autoclose=False, rerun, and cache the second. Then you would compare results. Note, you could check that ._autoclose is correct for each dataset datastore to make sure you are getting expected results. The test would then essentially have three assert statements, although you could break it out into two tests that are used in the third composite test.

@pwolfram
Copy link
Contributor

Also, I don't think we need a ton of datasets. A single combined dataset should be sufficient to get most of the cases because we aren't interested in breaking the autoclose=False case, so < 10 files would be fine. Three files is probably optimal.

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

@pwolfram, so I'm working on a test. I'm not terribly keen on going through the tedium of installing the trunk of xarray directly from github because I ran into issues the last time I tried that. I have asserts for _autoclose if the data set has such an attribute but obviously I can't check that that works correctly using the current version of xarray.

Once I have the test written and working with the current version of xarray, maybe you're in a position to make sure it will also work with the updated version?

@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

New test added.

@pwolfram
Copy link
Contributor

pwolfram commented Mar 23, 2017

Testing

Tested via https://gist.github.com/5871100a798ab8963dffafd676d245f7 using xarray master to yield

time python run_analysis.py ../MPAS-Analysis/configs/edison/config.20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison                                                             

Plotting OHC time series...
Reading files /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0255-12-01.nc
  Read in depth and compute specific depth indexes...
  Load ocean data...
  Compute temperature anomalies...
  Load in OHC from preprocessed reference run...
  Compute OHC and make plots...

real    52m2.516s
user    44m49.656s
sys     2m57.699s

with autocloseFileLimitFraction = 0.5

and also with autocloseFileLimitFraction = 0.001 so that open file limit of 65536 on edison is violated for this test case to require autoclose=True:

time python run_analysis.py ../MPAS-Analysis/configs/edison/config.20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison 

Plotting OHC time series...
Reading files /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0255-12-01.nc
  Read in depth and compute specific depth indexes...
  Load ocean data...
  Compute temperature anomalies...
  Load in OHC from preprocessed reference run...
  Compute OHC and make plots...

real    79m14.254s
user    65m51.151s
sys     13m10.397s

The important thing here is that both tests run without error and the runtimes are of similar order as far as time required.

Using these results to indicate performance is more challenging because of variable system loads, system file caching, etc, and is not recommended.

xylar added 2 commits March 24, 2017 00:33
We set autoclose to true if the number of files is too close to
the maximum number of files that can be open at a given time.

A new option, autocloseFileLimitFraction, has been added that
determines what fraction of the system limit of open files are
allowed.  The default is 50%.

Also, a bit of PEP8 clean up.
This test loads a 3-file data set both with and without autoclose.

Also, added more time series files to interpolate test, which are
then used by generalized_reader as well
@xylar xylar force-pushed the add_xarray_autoclose branch from 15068a6 to b775c24 Compare March 23, 2017 23:33
@xylar
Copy link
Collaborator Author

xylar commented Mar 23, 2017

I just rebased and fixed the time stamp on generalized_reader, as requested. Shouldn't affect any testing results, as all changes are only in comments and docstrings.

@pwolfram
Copy link
Contributor

@xylar, just added a few fixes to the test to ensure that we are actually using both cases of autoclose=True and autoclose=False.

@pwolfram
Copy link
Contributor

Following merge of xylar#1 this PR should be ready to be merged.

@pwolfram
Copy link
Contributor

Last double check (done in ~ 1hr) and then it is in unless there are objections!

@pwolfram pwolfram merged commit d64016a into MPAS-Dev:develop Mar 24, 2017
@pwolfram
Copy link
Contributor

Thanks everyone!

@xylar xylar deleted the add_xarray_autoclose branch March 24, 2017 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants