Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaks for opening datasets #895

Closed
wants to merge 2 commits into from
Closed

Conversation

tsupinie
Copy link

I tweaked the open_dataset() and open_mfdataset() functions for better performance with the PyNIO engine.

  1. I use a lot of HDF4 files with what I guess I'll call "malformed names," where the file name does not end with ".hdf". PyNIO is able to figure out the format of the file, but it prints an annoying warning. As far as I can tell, the only way to shut off the warning is to tell it the format of the file, which I've added a format option for.
  2. I added an option called only_variables which specifies which variables to load from the dataset in the event you don't want to load all variables. Say, for example, I have a dataset with 47 variables in it, but I only need 3 of them. If the data are not cached, then only loading the 3 I need cuts the I/O time in half. If they are cached, then loading only the 3 takes 20% of the time to load the full dataset. The only_variables option behaves pretty similarly to drop_variables. The default is to load all variables.

tsupinie added 2 commits July 1, 2016 20:10
When using the PyNIO engine, if you pass open_dataset() a file name that doesn't
end with the extension, it will complain. The complaint can't be shut off using
the built-in warnings filter. Passing PyNIO a format turns off the complaint.
Loading only the variables you'll need speeds up I/O for large
datasets.
@mathause
Copy link
Collaborator

I like (2) - I once needed this functionality and had to infer the drop_vars.

def get_drop_vars(variables, ds):

    if isinstance(variables, basestring):
        variables = [variables]
    else:
        variables = list(variables)

    # all variables
    drop_var = set(ds.variables.keys())

    # do not drop: coordinates, variables and time bounds
    keep_var = ds.coords.keys() + variables + ['time_bounds', 'time_bnds']

    drop_var.difference_update(keep_var)

    return drop_var`


@fmaussion
Copy link
Member

Maybe keep_variables would be a more appropriate name? ('keep' vs 'drop').

@@ -81,7 +81,8 @@ def check_name(name):
def open_dataset(filename_or_obj, group=None, decode_cf=True,
mask_and_scale=True, decode_times=True,
concat_characters=True, decode_coords=True, engine=None,
chunks=None, lock=None, drop_variables=None):
chunks=None, lock=None, drop_variables=None,
only_variables=None, format=''):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default should be format=None.

@shoyer
Copy link
Member

shoyer commented Jul 12, 2016

This looks like nice functionality!

The main thing it needs is tests. Let me know if you have any questions about how to go about writing those.

I also like the name keep_variables better than than only_variables.

@jhamman
Copy link
Member

jhamman commented Jul 13, 2017

@tsupinie - are you interested in finishing this up? Beyond a few changes requested by @shoyer, we just need some test coverage and a whatsnew entry.

@max-sixty
Copy link
Collaborator

Would anyone like to take this up? I was going to close as stale, but it's fairly close!

@mathause
Copy link
Collaborator

mathause commented Feb 1, 2019

Once #2732 is merged (#2380) the first issue from the list should be resolved. I'd still be interested to get No 2 in.Maybe it would be better to open a new PR?

@max-sixty
Copy link
Collaborator

Great, @mathause - all yours, given no response from OP

@mathause
Copy link
Collaborator

mathause commented Nov 8, 2020

I am closing this as stale. Half of this OR is implemented (#2732) the other half is #1754.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants