-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autoclose argument to xarray.open_mfdataset call #151
Conversation
@pwolfram, could you make sure this branch behaves as expected with more than The easiest way to test would be to set |
@@ -22,8 +22,7 @@ | |||
from ..shared.generalized_reader.generalized_reader \ | |||
import open_multifile_dataset | |||
|
|||
from ..shared.timekeeping.utility import get_simulation_start_time, \ | |||
date_to_days, days_to_datetime | |||
from ..shared.timekeeping.utility import get_simulation_start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused functions
@@ -80,7 +80,6 @@ def nino34_index(config, streamMap=None, variableMap=None): | |||
mainRunName = config.get('runs', 'mainRunName') | |||
plotsDirectory = buildConfigFullPath(config, 'output', 'plotsSubdirectory') | |||
|
|||
plotTitles = config.getExpression('regions', 'plotTitles') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable
start = days_to_datetime(np.amin(ds.Time.min()), calendar=calendar) | ||
end = days_to_datetime(np.amax(ds.Time.max()), calendar=calendar) | ||
f, spectra, conf99, conf95, redNoise = compute_nino34_spectra(nino34, | ||
config) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variables and long line
@@ -122,11 +120,12 @@ def nino34_index(config, streamMap=None, variableMap=None): | |||
def compute_nino34_index(regionSST, config): | |||
# {{{ | |||
""" | |||
Computes nino34 index time series. It follow the standard nino34 algorithm, i.e., | |||
Computes nino34 index time series. It follow the standard nino34 | |||
algorithm, i.e., |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long line
3) Performs a 5 month running mean over the anomalies | ||
1. Compute monthly average SST in the region | ||
2. Computes anomalous SST | ||
3. Performs a 5 month running mean over the anomalies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better output formatting
@@ -136,7 +135,8 @@ def compute_nino34_index(regionSST, config): | |||
regionSST : xarray.dataArray object | |||
values of SST in the nino region | |||
|
|||
config : instance of the MPAS configParser | |||
config : MpasConfigParser object | |||
the config options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better comment
@@ -159,7 +159,8 @@ def compute_nino34_index(regionSST, config): | |||
calendar = namelist.get('config_calendar_type') | |||
|
|||
# Compute monthly average and anomaly of climatology of SST | |||
monthlyClimatology = climatology.compute_monthly_climatology(regionSST, calendar) | |||
monthlyClimatology = climatology.compute_monthly_climatology(regionSST, | |||
calendar) | |||
anomalySST = regionSST.groupby('month') - monthlyClimatology |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long line
variableName, | ||
mainRunName) | ||
variableName, | ||
mainRunName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and below, aligning indentation
variableName) | ||
figureNamePolar = '{}/{}.{}_polar.png'.format(plotsDirectory, | ||
mainRunName, | ||
variableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligning indentation and long lines
titleNH, | ||
figureNamePolarNH, | ||
lineStyles=lineStyles, | ||
lineWidths=lineWidths, | ||
titleFontSize=titleFontSize, | ||
calendar=calendar) | ||
timeseries_analysis_plot_polar(config, varsSH, movingAveragePoints, | ||
timeseries_analysis_plot_polar(config, varsSH, | ||
movingAveragePoints, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long line
variableName) | ||
figureNamePolar = '{}/{}.{}_polar.png'.format(plotsDirectory, | ||
mainRunName, | ||
variableName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aligning indentation and long line
# outside the allowed range | ||
kwargs['decode_times'] = False | ||
# files differ along the Time dimension | ||
kwargs['concat_dim'] = 'Time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved these here in case a user sets decode_times
or concat_dims
in the kwargs
dictionary. I overwrite them without checking for existing values because the code really won't work if the user sets them to wrong values. I don't think there needs to be a warning about this because setting them explicitly is a bit weird anyway...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the warning would occur if the user tries to set them? I agree this would be a good warning (or error) to raise.
kwargs['concat_dim'] = 'Time' | ||
|
||
autoclose = ((autocloseFileCount is not None) and | ||
len(fileNames) > autocloseFileCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defaults to False
because autocloseFileCount=None
.
try: | ||
ds = xarray.open_mfdataset(fileNames, | ||
preprocess=preprocess_partial, | ||
autoclose=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I don't want to modify kwargs
to include autoclose
because I will use it later without autoclose
if this fails. If autoclose
is included in kwargs
, this line will fail and the warning below will likely be printed but then it will try again below without explicitly including autoclose
, so I think it will work out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not intuitive. I would recommend handling all autoclose=True
issues in this file and not allow specification of this option outside this file. There should be some options that the reader takes from the config files and if we currently don't have the option to read in the simulation config files in the reader I would argue this is a functionality we ultimately need because it is possible there will be other options like autoclose
that really don't belong in analyses individually but are just relevant to the process of reading netCDF files for given platforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dude, @pwolfram, you really can be "judgy" about code. "This is not intuitive" is kind of harsh. You mean, I presume, "I would do this differently."
But okay, I will take out kwargs. I will instead pass in the config. This will (at some later point) allow us to support dask chunking which seems like something we might already need for the MOC calculations, for example.
I will support autoclose
internally. I can use the heuristic you are suggesting, if you give me some code for it...
openedWithAutoclose = True | ||
except TypeError as e: | ||
if 'autoclose' in str(e): | ||
# This indicates that xarray version doesn't support autoclose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check if autoclose
is somewhere in the error message (so we don't print an unhelpful warning when other arguments besides autoclose
are the problem).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, we need to double check the final merged version of pydata/xarray#1198 to make sure this works. Please feel free to take a quick look if you would like too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this exception handling is dependent on the xarray PR in any way unless you change the name of the argument. If the argument is not supported (e.g. in the current version of xarray) you get a TypeError
, which I have already verified is the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use autocloseFileCount
internally to the reader and optimally compute it internally? I can't think of a clean reason why this should be varied between analyses, although it should be varied between platform. I strongly recommend only including it in the generalized_reader.py
file to keep the signature for open_multifile_dataset
clean in terms of analysis task usage. Otherwise, we change a lot of lines of code but don't really add too much value because we shouldn't change the usage of open_multifile_dataset
between analyses, just computational platforms. The guiding principle here is simpler code for the application layer.
Also, thanks for highlighting formatting changes explicitly in absence of a dedicated commit related to formatting.
# outside the allowed range | ||
kwargs['decode_times'] = False | ||
# files differ along the Time dimension | ||
kwargs['concat_dim'] = 'Time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the warning would occur if the user tries to set them? I agree this would be a good warning (or error) to raise.
try: | ||
ds = xarray.open_mfdataset(fileNames, | ||
preprocess=preprocess_partial, | ||
autoclose=True, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not intuitive. I would recommend handling all autoclose=True
issues in this file and not allow specification of this option outside this file. There should be some options that the reader takes from the config files and if we currently don't have the option to read in the simulation config files in the reader I would argue this is a functionality we ultimately need because it is possible there will be other options like autoclose
that really don't belong in analyses individually but are just relevant to the process of reading netCDF files for given platforms.
openedWithAutoclose = True | ||
except TypeError as e: | ||
if 'autoclose' in str(e): | ||
# This indicates that xarray version doesn't support autoclose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, we need to double check the final merged version of pydata/xarray#1198 to make sure this works. Please feel free to take a quick look if you would like too.
# This indicates that xarray version doesn't support autoclose | ||
print 'Warning: open_multifile_dataset is trying to use autoclose=True but\n' \ | ||
'it appears your xarray version doesn\'t support this argument. Will\n' \ | ||
'try again without autoclose argument.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
config.default
Outdated
@@ -47,6 +47,10 @@ seaIceStreamsFileName = streams.cice | |||
# names of ocean and sea ice meshes (e.g. EC60to30, QU240, RRS30to10, etc.) | |||
mpasMeshName = mesh | |||
|
|||
# this is the maximum number of files, beyond which xarray will close files | |||
# after data has been read to prevent errors related to too many open files. | |||
autocloseFileCount = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should specify this necessarily. It may be better to use something like ulimit -a | grep open | awk '{print $4}'
and a heuristic to determine if we need autoclose=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please propose code for this that I can add to open_multifile_dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is the python equivalent:
https://docs.python.org/2/library/resource.html
So the code would be
resource.getrlimit(resource.RLIMIT_NOFILE)[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the heuristic you would suggest? Half this number? This number minus a buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see discussion regarding autocloseSF
and need to also consider number of tasks that are running too.
Don't merge flag corresponds to final issues being worked out for pydata/xarray#1198, which should be merged first. |
Okay, I could pass the config instead of the If I understand right, you are advocating not allowing |
It's nearly impossible to do dedicated PRs for these formatting changes because they are a constant battle. So this will be my compromise... |
accfc29
to
b135bc0
Compare
@pwolfram, sorry for being grumpy earlier. I guess you're used to it from me by now... I updated the code, I think to meet your requests. The heuristic I'm currently using to determine the maximum number of files is the softLimit/2 but feel free to suggest another heuristic. I removed kwargs from |
b135bc0
to
9d55ba1
Compare
## This file contains the default values of all possible configuration options | ||
## used to run analysis tasks. Do not modify options in this file direct. | ||
## This file contains the default values of all possible configuration options | ||
## used to run analysis tasks. Do not modify options in this file direct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and elsewhere in this file, it is just automatically deleting trailing whitespace. No other changes were made.
TestingI ran the full range of tasks on my laptop using both the QU240 and EC60to30 results. I also tested the QU240 test case with |
8912d31
to
d15e7c0
Compare
# default, xarray attempts to open all files in a data set simultaneously. | ||
# A new option allows files to be automatically closed as a data set is being | ||
# read to prevent hitting this limit. Here, you can set what fraction of the | ||
# system limit of open files an analysis task is allowed to use. Note: In the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a typo somewhere on the line above? I do not understand the sentence starting with Here, you can set...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, but I'm happy to clarify. autocloseFileLimitFraciton is multiplied by the system limit on the number of files that can be opened. The result is the maximum number of files that we allow xarray to open before we use autoclose. Would the following be clearer? "Here, you can set the fraction of the system limit that xarray is allowed to use to open data sets at any given time. " If not, please suggest an alternative wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying. Yes the new wording is clearer.
try: | ||
ds = xarray.open_mfdataset(fileNames, | ||
preprocess=preprocess_partial, | ||
autoclose=autoclose, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update date stamp on this function (line 116 above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xylar, if we could test loading of a dataset and performing computations on it, e.g., a climatology, with autoclose=False
and autoclose=True
that would be really good to get early-warning notification via CI if something gets broken (e.g., the name for autoclose
was changed for instance). Essentially, this protects this work against upstream breaks in API.
@xylar, a few more requests following another read-through on this code, especially related to adding enhanced testing to make sure both autoclose cases behave as expected. One idea would be to compute a simple climatology, varying |
@pwolfram, I will agree to add some CI. Given how we set this up, it's going to be tricky to test I don't think we have an appropriate data set for CI testing that would be able to test reading from multiple files and something like performing a climatology. Even the lower res data sets we have are too big for that. Creating fake data sets or stripping out data from existing files for CI is very tedious but I'll do it if it's really a good use of my time... |
I suppose it could be a good way to test |
I was thinking you'd have a function that you run to cache the result. You'd change the config option for |
Also, I don't think we need a ton of datasets. A single combined dataset should be sufficient to get most of the cases because we aren't interested in breaking the |
@pwolfram, so I'm working on a test. I'm not terribly keen on going through the tedium of installing the trunk of xarray directly from github because I ran into issues the last time I tried that. I have asserts for Once I have the test written and working with the current version of xarray, maybe you're in a position to make sure it will also work with the updated version? |
New test added. |
TestingTested via https://gist.github.com/5871100a798ab8963dffafd676d245f7 using xarray master to yield time python run_analysis.py ../MPAS-Analysis/configs/edison/config.20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison
Plotting OHC time series...
Reading files /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0255-12-01.nc
Read in depth and compute specific depth indexes...
Load ocean data...
Compute temperature anomalies...
Load in OHC from preprocessed reference run...
Compute OHC and make plots...
real 52m2.516s
user 44m49.656s
sys 2m57.699s with and also with time python run_analysis.py ../MPAS-Analysis/configs/edison/config.20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison
Plotting OHC time series...
Reading files /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0001-12-01.nc through /scratch2/scratchdirs/golaz/ACME_simulations/20161117.beta0.A_WCYCL1850S.ne30_oEC_ICG.edison/run/mpaso.hist.am.timeSeriesStatsMonthly.0255-12-01.nc
Read in depth and compute specific depth indexes...
Load ocean data...
Compute temperature anomalies...
Load in OHC from preprocessed reference run...
Compute OHC and make plots...
real 79m14.254s
user 65m51.151s
sys 13m10.397s The important thing here is that both tests run without error and the runtimes are of similar order as far as time required. Using these results to indicate performance is more challenging because of variable system loads, system file caching, etc, and is not recommended. |
We set autoclose to true if the number of files is too close to the maximum number of files that can be open at a given time. A new option, autocloseFileLimitFraction, has been added that determines what fraction of the system limit of open files are allowed. The default is 50%. Also, a bit of PEP8 clean up.
This test loads a 3-file data set both with and without autoclose. Also, added more time series files to interpolate test, which are then used by generalized_reader as well
15068a6
to
b775c24
Compare
I just rebased and fixed the time stamp on generalized_reader, as requested. Shouldn't affect any testing results, as all changes are only in comments and docstrings. |
@xylar, just added a few fixes to the test to ensure that we are actually using both cases of |
Following merge of xylar#1 this PR should be ready to be merged. |
Last double check (done in ~ 1hr) and then it is in unless there are objections! |
Thanks everyone! |
If a sufficiently large number of files is to be opened, the
autoclose
argument toxarray.open_mfdataset
(if available) is set toTrue
. This option should prevent crashes with too many files open.Also, a bit of PEP8 cleanup in the Nino3.4 index and sea ice time series task. (As usual, I couldn't just let it be...)