Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert pandas dataframes to netCDF DSG files #1074

Closed
wants to merge 2 commits into from

Conversation

zbruick
Copy link
Contributor

@zbruick zbruick commented Jun 27, 2019

New module in metpy.io to convert DataFrames to netCDF files, with intention to make them CF-compliant DSG files if enough user input is given. This was tested against METAR DataFrames parsed by @mgrover1, but can be used for time series, profiles, and trajectories. I'm open to suggestions on revisions from anyone, as this is a first attempt at writing these files. I also wrote it as a class - not sure if that is correct form. Again, first time doing that as well.

Note: currently, this only supports writing the files, although the intention is to make them appendable. However, through Xarray, this is not supported currently (see: pydata/xarray#1672).

@zbruick zbruick requested a review from dopplershift as a code owner June 27, 2019 18:19
@jthielen
Copy link
Collaborator

jthielen commented Jun 28, 2019

A couple of questions to ask related to this, if that is okay:

  • Would it ever be useful to have access to the in-memory DSG Dataset without writing to it NetCDF? If so, it could make sense to separate this into two steps (pandas DataFrame -> DSG xarray Dataset, then DSG Dataset -> DSG NetCDF file).
  • Are there plans for the inverse (DSG Dataset -> pandas DataFrame)?

Also, I think it would make more sense to just be a function, rather than a class with only an __init__ method.

@zbruick
Copy link
Contributor Author

zbruick commented Jun 28, 2019

Appreciate all the questions and input! I don't know if anyone would want the DSG Dataset, but that would be a pretty easy change. I would guess for things like METAR/Station plotting, they would just stick with the DataFrame, but that's not a guarantee. If we go down that route (or if we add the inverse), then it would make more sense to keep the class and move everything into functions within it, correct? Otherwise, I agree that switching it just to a function is the better way to go. I'll wait to see what @dopplershift's thoughts are on this before proceeding.

Copy link
Member

@dopplershift dopplershift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only done a rough review to this point, so there may be some other things lurking.

I agree that this should be a function. We really only need a class if there's some kind of state that needs passing around together to multiple functions, and that needs to be kept in sync together. I'm not seeing that here.

As far as multiple functions go, I'm split:

  • On one hand, functions should do one thing, and if there are two logical steps, those could be two functions
  • On the other hand, that makes the common case require users to put together two things rather than just know and use one. Also, that means we have to support, "forever", the two things.

I lean towards a single function until we identify such a use case for needing individual parts. Can anyone identify a use case for wanting the in-memory (xarray) for the DSG? I can maybe see that for testing or something, but I'm not sure if that's enough...

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
@jthielen
Copy link
Collaborator

jthielen commented Jul 4, 2019

The two use cases I initially had in mind were

  • adding additional attributes before writing to netCDF
  • using a zarr store instead of netCDF (which seems to handle appending easier)

But, on second thought, I'm not sure if having a "CF-compliant DSG zarr store" is really a needed use case. With that, I wouldn't think the extra attributes case is worth the extra complications to the most common use case, so unless someone has another practical use case, I'd lean toward the single function as well.

(Regardless, I still think it would be great to have an inverse function at some point, especially since it was sounding like pandas DataFrames would be MetPy's data model of choice for point data.)

@zbruick
Copy link
Contributor Author

zbruick commented Jul 5, 2019

From the comments above, I'll move forward with one function for now and we can revisit this if/when we identify other use cases that justify the separation of steps.

@zbruick
Copy link
Contributor Author

zbruick commented Jul 17, 2019

Updated the function based on the review, lint, and reducing complexity (hopefully) by moving metadata assignment to a private function.

@zbruick zbruick force-pushed the pandas_netcdf branch 2 times, most recently from 37de73c to 5ea96d6 Compare July 19, 2019 18:51
Copy link
Member

@dopplershift dopplershift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found just a few minor things.

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
metpy/io/__init__.py Outdated Show resolved Hide resolved
metpy/io/__init__.py Outdated Show resolved Hide resolved
Copy link
Member

@dopplershift dopplershift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. This started as finding a typo, then I had a question about adding to staticdata...

metpy/io/__init__.py Outdated Show resolved Hide resolved
metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
metpy/io/tests/test_pandas_to_netcdf.py Outdated Show resolved Hide resolved
@zbruick
Copy link
Contributor Author

zbruick commented Aug 8, 2019

If there are ideas on how to get the remaining uncovered diffs (two error messages) to be tested, let me know. I haven't come up with anything, and that's the only remaining issue here.

Copy link
Member

@dopplershift dopplershift left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't noticed the multiple unlimited dimensions before, which I don't think is actually CF-compliant--though that's subtle and does not seem to be well-documented.

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved
path_to_save = str(path_to_save)

if check_netcdf4 is not None:
unlimited_dimensions = ['samples', 'observations']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've always been told CF doesn't support multiple unlimited dimensions--though I can't find that in the spec anywhere, only many references to "the unlimited dimension". @ethanrd @lesserwhirls can you weigh in here?

I'm thinking we may need to (optionally) support the "ragged array" options that are described in the CF spec to collapse the observation and sample dimensions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single unlimited dimension is a netCDF Classic Model limitation:

An unlimited dimension has a length that can be expanded at any time, as more data are written to it. NetCDF files can contain at most one unlimited dimension.

but not for the netCDF Enhanced Data Model (netCDF-4):

The Enhanced Data Model supports the classic model in a completely backward-compatible way, while allowing access to new features such as groups, multiple unlimited dimensions, and new types, including user-defined types.

See https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html

Now, as far as I understand, CF does not explicitly say which data model is follows (classic vs enhanced), although classic is implied in several places, such as the discussion on supported data types, examples that only use one unlimited dimension, lack of any mention of the use groups, etc. That does not, however, mean CF only applies to the Classic Data Model. See cf-convention/cf-conventions#191.

So, are multiple unlimited dimensions supported by CF?

dunno

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad I wasn't the only one confused by the CF specs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this back to one unlimited dimension only if desired. My thought for two dimensions is if we get this to appending mode in the future, we would need to append along the "station" and "time" dimensions for METARs for example. Right now, given that we aren't including the ability to append due to Xarray's limitations, we could just go with one unlimited dimension and update later when needed/if the CF specs are clarified at any point.

@zbruick zbruick force-pushed the pandas_netcdf branch 2 times, most recently from db0c6ba to 06ea202 Compare August 8, 2019 21:49
@zbruick
Copy link
Contributor Author

zbruick commented Aug 9, 2019

Just to clarify the issues with appending currently: Xarray overwrites variables if they already exist - it doesn't allow for appending along unlimited dimensions without overwriting. This again is documented in pydata/xarray#1672.

Two paths forward that I see/plan to investigate.

  • Use netcdf4-python to append (if that doesn't overwrite too)
  • If appending is desired, pull out existing data in netCDF as a xarray.Dataset and merge with the dataframe/Dataset to be appended, remove old file, and rewrite new file.

I plan to investigate both to see which makes more sense, as this is a desired feature and is likely easier than trying to fix Xarray at this point in time.

@zbruick
Copy link
Contributor Author

zbruick commented Aug 15, 2019

I've now added appending ability, but in doing so, I've officially added a pandas dependency. This is a first cut at doing this, and I'm happy to refactor it. The current method is possibly the most computationally expensive, and I should investigate more if xarray could do the same work here (depends on how much of the needed pandas support has been built in).

@zbruick
Copy link
Contributor Author

zbruick commented Aug 15, 2019

As a reference as to why I went back to a Dataframe and haven't figured out how to do this all as a Dataset: I think it's easier to reset the index and then re-groupby the observations so that all obs are collected for each station/profile/trajectory, and then reassigned to the dataset. It isn't clear to me if the xarray functionality for reset_index and groupby will accomplish this task as desired, as the dimensions need to be operated on/updated and that doesn't seem readily doable in xarray.

@zbruick
Copy link
Contributor Author

zbruick commented Aug 22, 2019

Pandas dependency docs have been updated. Not sure if the 2.7 test will pass, since the current version of pandas (0.25.0) does not support 2.7 anymore, and I didn't pin the pandas version maximum specifically for 2.7.

@dopplershift dopplershift added this to the 0.12 milestone Oct 2, 2019
Intended for conversion to DSG netCDF format, but if not enough
parameters are set by the user, a non-CF compliant netCDF will be
written, but warned about. This should work for time series, profiles,
or trajectories, with main testing done against METAR dataframes.
@zbruick
Copy link
Contributor Author

zbruick commented Oct 8, 2019

Dropped the pandas dependency since that was added already. Rebased and pushed to see what the current state of things are.

@dopplershift dopplershift modified the milestones: 0.12, 1.0 Dec 23, 2019
@dopplershift dopplershift modified the milestones: 1.0, 1.1 Jan 10, 2020
Base automatically changed from master to main February 22, 2021 22:39
@dopplershift dopplershift modified the milestones: 1.1.0, 1.2.0 Aug 2, 2021
@jthielen
Copy link
Collaborator

With xarray-contrib/cf-xarray#122 and xarray-contrib/cf-xarray#257, would it be worth leaving this functionality (and #1121) for cf-xarray to handle upstream?

@dcherian
Copy link

dcherian commented Sep 5, 2021

I could use some help with reviewing xarray-contrib/cf-xarray#260 if someone here is up for it

@jthielen jthielen removed this from the 1.2.0 milestone Oct 8, 2021
@jthielen
Copy link
Collaborator

jthielen commented Oct 8, 2021

See #1121 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants