Convert pandas dataframes to netCDF DSG files #1074

zbruick · 2019-06-27T18:19:05Z

New module in metpy.io to convert DataFrames to netCDF files, with intention to make them CF-compliant DSG files if enough user input is given. This was tested against METAR DataFrames parsed by @mgrover1, but can be used for time series, profiles, and trajectories. I'm open to suggestions on revisions from anyone, as this is a first attempt at writing these files. I also wrote it as a class - not sure if that is correct form. Again, first time doing that as well.

Note: currently, this only supports writing the files, although the intention is to make them appendable. However, through Xarray, this is not supported currently (see: pydata/xarray#1672).

jthielen · 2019-06-28T20:41:33Z

A couple of questions to ask related to this, if that is okay:

Would it ever be useful to have access to the in-memory DSG Dataset without writing to it NetCDF? If so, it could make sense to separate this into two steps (pandas DataFrame -> DSG xarray Dataset, then DSG Dataset -> DSG NetCDF file).
Are there plans for the inverse (DSG Dataset -> pandas DataFrame)?

Also, I think it would make more sense to just be a function, rather than a class with only an __init__ method.

zbruick · 2019-06-28T21:59:19Z

Appreciate all the questions and input! I don't know if anyone would want the DSG Dataset, but that would be a pretty easy change. I would guess for things like METAR/Station plotting, they would just stick with the DataFrame, but that's not a guarantee. If we go down that route (or if we add the inverse), then it would make more sense to keep the class and move everything into functions within it, correct? Otherwise, I agree that switching it just to a function is the better way to go. I'll wait to see what @dopplershift's thoughts are on this before proceeding.

dopplershift

I've only done a rough review to this point, so there may be some other things lurking.

I agree that this should be a function. We really only need a class if there's some kind of state that needs passing around together to multiple functions, and that needs to be kept in sync together. I'm not seeing that here.

As far as multiple functions go, I'm split:

On one hand, functions should do one thing, and if there are two logical steps, those could be two functions
On the other hand, that makes the common case require users to put together two things rather than just know and use one. Also, that means we have to support, "forever", the two things.

I lean towards a single function until we identify such a use case for needing individual parts. Can anyone identify a use case for wanting the in-memory (xarray) for the DSG? I can maybe see that for testing or something, but I'm not sure if that's enough...

metpy/io/pandas_to_netcdf.py

jthielen · 2019-07-04T01:23:59Z

The two use cases I initially had in mind were

adding additional attributes before writing to netCDF
using a zarr store instead of netCDF (which seems to handle appending easier)

But, on second thought, I'm not sure if having a "CF-compliant DSG zarr store" is really a needed use case. With that, I wouldn't think the extra attributes case is worth the extra complications to the most common use case, so unless someone has another practical use case, I'd lean toward the single function as well.

(Regardless, I still think it would be great to have an inverse function at some point, especially since it was sounding like pandas DataFrames would be MetPy's data model of choice for point data.)

zbruick · 2019-07-05T15:17:16Z

From the comments above, I'll move forward with one function for now and we can revisit this if/when we identify other use cases that justify the separation of steps.

zbruick · 2019-07-17T18:42:32Z

Updated the function based on the review, lint, and reducing complexity (hopefully) by moving metadata assignment to a private function.

dopplershift

Found just a few minor things.

metpy/io/pandas_to_netcdf.py

metpy/io/__init__.py

dopplershift

Sorry. This started as finding a typo, then I had a question about adding to staticdata...

metpy/io/__init__.py

metpy/io/pandas_to_netcdf.py

metpy/io/tests/test_pandas_to_netcdf.py

zbruick · 2019-08-08T16:36:13Z

If there are ideas on how to get the remaining uncovered diffs (two error messages) to be tested, let me know. I haven't come up with anything, and that's the only remaining issue here.

dopplershift

I hadn't noticed the multiple unlimited dimensions before, which I don't think is actually CF-compliant--though that's subtle and does not seem to be well-documented.

metpy/io/pandas_to_netcdf.py

dopplershift · 2019-08-08T17:10:52Z

metpy/io/pandas_to_netcdf.py

+    path_to_save = str(path_to_save)
+
+    if check_netcdf4 is not None:
+        unlimited_dimensions = ['samples', 'observations']


So I've always been told CF doesn't support multiple unlimited dimensions--though I can't find that in the spec anywhere, only many references to "the unlimited dimension". @ethanrd @lesserwhirls can you weigh in here?

I'm thinking we may need to (optionally) support the "ragged array" options that are described in the CF spec to collapse the observation and sample dimensions.

The single unlimited dimension is a netCDF Classic Model limitation:

An unlimited dimension has a length that can be expanded at any time, as more data are written to it. NetCDF files can contain at most one unlimited dimension.

but not for the netCDF Enhanced Data Model (netCDF-4):

The Enhanced Data Model supports the classic model in a completely backward-compatible way, while allowing access to new features such as groups, multiple unlimited dimensions, and new types, including user-defined types.

See https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html

Now, as far as I understand, CF does not explicitly say which data model is follows (classic vs enhanced), although classic is implied in several places, such as the discussion on supported data types, examples that only use one unlimited dimension, lack of any mention of the use groups, etc. That does not, however, mean CF only applies to the Classic Data Model. See cf-convention/cf-conventions#191.

So, are multiple unlimited dimensions supported by CF?

Glad I wasn't the only one confused by the CF specs...

I can change this back to one unlimited dimension only if desired. My thought for two dimensions is if we get this to appending mode in the future, we would need to append along the "station" and "time" dimensions for METARs for example. Right now, given that we aren't including the ability to append due to Xarray's limitations, we could just go with one unlimited dimension and update later when needed/if the CF specs are clarified at any point.

zbruick · 2019-08-09T18:50:56Z

Just to clarify the issues with appending currently: Xarray overwrites variables if they already exist - it doesn't allow for appending along unlimited dimensions without overwriting. This again is documented in pydata/xarray#1672.

Two paths forward that I see/plan to investigate.

Use netcdf4-python to append (if that doesn't overwrite too)
If appending is desired, pull out existing data in netCDF as a xarray.Dataset and merge with the dataframe/Dataset to be appended, remove old file, and rewrite new file.

I plan to investigate both to see which makes more sense, as this is a desired feature and is likely easier than trying to fix Xarray at this point in time.

zbruick · 2019-08-15T21:02:36Z

I've now added appending ability, but in doing so, I've officially added a pandas dependency. This is a first cut at doing this, and I'm happy to refactor it. The current method is possibly the most computationally expensive, and I should investigate more if xarray could do the same work here (depends on how much of the needed pandas support has been built in).

zbruick · 2019-08-15T22:12:31Z

As a reference as to why I went back to a Dataframe and haven't figured out how to do this all as a Dataset: I think it's easier to reset the index and then re-groupby the observations so that all obs are collected for each station/profile/trajectory, and then reassigned to the dataset. It isn't clear to me if the xarray functionality for reset_index and groupby will accomplish this task as desired, as the dimensions need to be operated on/updated and that doesn't seem readily doable in xarray.

zbruick · 2019-08-22T15:31:22Z

Pandas dependency docs have been updated. Not sure if the 2.7 test will pass, since the current version of pandas (0.25.0) does not support 2.7 anymore, and I didn't pin the pandas version maximum specifically for 2.7.

Intended for conversion to DSG netCDF format, but if not enough parameters are set by the user, a non-CF compliant netCDF will be written, but warned about. This should work for time series, profiles, or trajectories, with main testing done against METAR dataframes.

zbruick · 2019-10-08T14:46:30Z

Dropped the pandas dependency since that was added already. Rebased and pushed to see what the current state of things are.

jthielen · 2021-08-23T16:43:18Z

With xarray-contrib/cf-xarray#122 and xarray-contrib/cf-xarray#257, would it be worth leaving this functionality (and #1121) for cf-xarray to handle upstream?

dcherian · 2021-09-05T17:01:21Z

I could use some help with reviewing xarray-contrib/cf-xarray#260 if someone here is up for it

jthielen · 2021-10-08T21:37:33Z

See #1121 (comment).

zbruick requested a review from dopplershift as a code owner June 27, 2019 18:19

dopplershift requested changes Jul 3, 2019

View reviewed changes

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved

zbruick force-pushed the pandas_netcdf branch from 9a6af29 to 2bf5af6 Compare July 17, 2019 18:41

zbruick force-pushed the pandas_netcdf branch 2 times, most recently from 37de73c to 5ea96d6 Compare July 19, 2019 18:51

dopplershift requested changes Aug 1, 2019

View reviewed changes

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved

metpy/io/__init__.py Outdated Show resolved Hide resolved

metpy/io/__init__.py Outdated Show resolved Hide resolved

zbruick force-pushed the pandas_netcdf branch from 5ea96d6 to 3b89e85 Compare August 2, 2019 14:19

dopplershift requested changes Aug 5, 2019

View reviewed changes

metpy/io/__init__.py Outdated Show resolved Hide resolved

metpy/io/pandas_to_netcdf.py Outdated Show resolved Hide resolved

metpy/io/tests/test_pandas_to_netcdf.py Outdated Show resolved Hide resolved

zbruick force-pushed the pandas_netcdf branch from 3b89e85 to 8281e71 Compare August 6, 2019 15:53

dopplershift requested changes Aug 8, 2019

View reviewed changes

zbruick mentioned this pull request Aug 8, 2019

Create Pandas Dataframe from DSG netCDF #1121

Closed

zbruick force-pushed the pandas_netcdf branch 2 times, most recently from db0c6ba to 06ea202 Compare August 8, 2019 21:49

zbruick force-pushed the pandas_netcdf branch from 16c9b7d to d93ea81 Compare August 22, 2019 15:30

dopplershift added this to the 0.12 milestone Oct 2, 2019

zbruick force-pushed the pandas_netcdf branch from d93ea81 to 7160246 Compare October 8, 2019 14:45

Add appending capability to pandas_to_netcdf

4dad48f

zbruick force-pushed the pandas_netcdf branch from 7160246 to 4dad48f Compare October 8, 2019 15:33

dopplershift modified the milestones: 0.12, 1.0 Dec 23, 2019

dopplershift modified the milestones: 1.0, 1.1 Jan 10, 2020

This was referenced Apr 21, 2020

Handle different grid coordinate formats and naming JiaweiZhuang/xESMF#74

Open

Discussion of a "cf-xarray" package pangeo-data/pangeo#771

Closed

Base automatically changed from master to main February 22, 2021 22:39

dopplershift requested a review from dcamron as a code owner February 22, 2021 22:39

dopplershift modified the milestones: 1.1.0, 1.2.0 Aug 2, 2021

jthielen removed this from the 1.2.0 milestone Oct 8, 2021

jthielen closed this Oct 8, 2021

dopplershift mentioned this pull request May 3, 2022

Going from TDSCatalog to pandas via xarray uses excessive memory Unidata/siphon#304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert pandas dataframes to netCDF DSG files #1074

Convert pandas dataframes to netCDF DSG files #1074

zbruick commented Jun 27, 2019

jthielen commented Jun 28, 2019 •

edited

Loading

zbruick commented Jun 28, 2019

dopplershift left a comment

jthielen commented Jul 4, 2019

zbruick commented Jul 5, 2019

zbruick commented Jul 17, 2019

dopplershift left a comment

dopplershift left a comment

zbruick commented Aug 8, 2019

dopplershift left a comment

dopplershift Aug 8, 2019

lesserwhirls Aug 8, 2019

zbruick Aug 8, 2019

zbruick Aug 8, 2019

zbruick commented Aug 9, 2019

zbruick commented Aug 15, 2019

zbruick commented Aug 15, 2019

zbruick commented Aug 22, 2019

zbruick commented Oct 8, 2019

jthielen commented Aug 23, 2021

dcherian commented Sep 5, 2021

jthielen commented Oct 8, 2021

Convert pandas dataframes to netCDF DSG files #1074

Convert pandas dataframes to netCDF DSG files #1074

Conversation

zbruick commented Jun 27, 2019

jthielen commented Jun 28, 2019 • edited Loading

zbruick commented Jun 28, 2019

dopplershift left a comment

Choose a reason for hiding this comment

jthielen commented Jul 4, 2019

zbruick commented Jul 5, 2019

zbruick commented Jul 17, 2019

dopplershift left a comment

Choose a reason for hiding this comment

dopplershift left a comment

Choose a reason for hiding this comment

zbruick commented Aug 8, 2019

dopplershift left a comment

Choose a reason for hiding this comment

dopplershift Aug 8, 2019

Choose a reason for hiding this comment

lesserwhirls Aug 8, 2019

Choose a reason for hiding this comment

zbruick Aug 8, 2019

Choose a reason for hiding this comment

zbruick Aug 8, 2019

Choose a reason for hiding this comment

zbruick commented Aug 9, 2019

zbruick commented Aug 15, 2019

zbruick commented Aug 15, 2019

zbruick commented Aug 22, 2019

zbruick commented Oct 8, 2019

jthielen commented Aug 23, 2021

dcherian commented Sep 5, 2021

jthielen commented Oct 8, 2021

jthielen commented Jun 28, 2019 •

edited

Loading