-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert pandas dataframes to netCDF DSG files #1074
Conversation
A couple of questions to ask related to this, if that is okay:
Also, I think it would make more sense to just be a function, rather than a class with only an |
Appreciate all the questions and input! I don't know if anyone would want the DSG Dataset, but that would be a pretty easy change. I would guess for things like METAR/Station plotting, they would just stick with the DataFrame, but that's not a guarantee. If we go down that route (or if we add the inverse), then it would make more sense to keep the class and move everything into functions within it, correct? Otherwise, I agree that switching it just to a function is the better way to go. I'll wait to see what @dopplershift's thoughts are on this before proceeding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only done a rough review to this point, so there may be some other things lurking.
I agree that this should be a function. We really only need a class if there's some kind of state that needs passing around together to multiple functions, and that needs to be kept in sync together. I'm not seeing that here.
As far as multiple functions go, I'm split:
- On one hand, functions should do one thing, and if there are two logical steps, those could be two functions
- On the other hand, that makes the common case require users to put together two things rather than just know and use one. Also, that means we have to support, "forever", the two things.
I lean towards a single function until we identify such a use case for needing individual parts. Can anyone identify a use case for wanting the in-memory (xarray) for the DSG? I can maybe see that for testing or something, but I'm not sure if that's enough...
The two use cases I initially had in mind were
But, on second thought, I'm not sure if having a "CF-compliant DSG zarr store" is really a needed use case. With that, I wouldn't think the extra attributes case is worth the extra complications to the most common use case, so unless someone has another practical use case, I'd lean toward the single function as well. (Regardless, I still think it would be great to have an inverse function at some point, especially since it was sounding like pandas DataFrames would be MetPy's data model of choice for point data.) |
From the comments above, I'll move forward with one function for now and we can revisit this if/when we identify other use cases that justify the separation of steps. |
Updated the function based on the review, lint, and reducing complexity (hopefully) by moving metadata assignment to a private function. |
37de73c
to
5ea96d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found just a few minor things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. This started as finding a typo, then I had a question about adding to staticdata
...
If there are ideas on how to get the remaining uncovered diffs (two error messages) to be tested, let me know. I haven't come up with anything, and that's the only remaining issue here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't noticed the multiple unlimited dimensions before, which I don't think is actually CF-compliant--though that's subtle and does not seem to be well-documented.
metpy/io/pandas_to_netcdf.py
Outdated
path_to_save = str(path_to_save) | ||
|
||
if check_netcdf4 is not None: | ||
unlimited_dimensions = ['samples', 'observations'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I've always been told CF doesn't support multiple unlimited dimensions--though I can't find that in the spec anywhere, only many references to "the unlimited dimension". @ethanrd @lesserwhirls can you weigh in here?
I'm thinking we may need to (optionally) support the "ragged array" options that are described in the CF spec to collapse the observation and sample dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single unlimited dimension is a netCDF Classic Model limitation:
An unlimited dimension has a length that can be expanded at any time, as more data are written to it. NetCDF files can contain at most one unlimited dimension.
but not for the netCDF Enhanced Data Model (netCDF-4):
The Enhanced Data Model supports the classic model in a completely backward-compatible way, while allowing access to new features such as groups, multiple unlimited dimensions, and new types, including user-defined types.
See https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html
Now, as far as I understand, CF does not explicitly say which data model is follows (classic vs enhanced), although classic is implied in several places, such as the discussion on supported data types, examples that only use one unlimited dimension, lack of any mention of the use groups, etc. That does not, however, mean CF only applies to the Classic Data Model. See cf-convention/cf-conventions#191.
So, are multiple unlimited dimensions supported by CF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad I wasn't the only one confused by the CF specs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this back to one unlimited dimension only if desired. My thought for two dimensions is if we get this to appending mode in the future, we would need to append along the "station" and "time" dimensions for METARs for example. Right now, given that we aren't including the ability to append due to Xarray's limitations, we could just go with one unlimited dimension and update later when needed/if the CF specs are clarified at any point.
db0c6ba
to
06ea202
Compare
Just to clarify the issues with appending currently: Xarray overwrites variables if they already exist - it doesn't allow for appending along unlimited dimensions without overwriting. This again is documented in pydata/xarray#1672. Two paths forward that I see/plan to investigate.
I plan to investigate both to see which makes more sense, as this is a desired feature and is likely easier than trying to fix Xarray at this point in time. |
I've now added appending ability, but in doing so, I've officially added a pandas dependency. This is a first cut at doing this, and I'm happy to refactor it. The current method is possibly the most computationally expensive, and I should investigate more if xarray could do the same work here (depends on how much of the needed pandas support has been built in). |
As a reference as to why I went back to a Dataframe and haven't figured out how to do this all as a Dataset: I think it's easier to reset the index and then re-groupby the observations so that all obs are collected for each station/profile/trajectory, and then reassigned to the dataset. It isn't clear to me if the xarray functionality for reset_index and groupby will accomplish this task as desired, as the dimensions need to be operated on/updated and that doesn't seem readily doable in xarray. |
Pandas dependency docs have been updated. Not sure if the 2.7 test will pass, since the current version of pandas (0.25.0) does not support 2.7 anymore, and I didn't pin the pandas version maximum specifically for 2.7. |
Intended for conversion to DSG netCDF format, but if not enough parameters are set by the user, a non-CF compliant netCDF will be written, but warned about. This should work for time series, profiles, or trajectories, with main testing done against METAR dataframes.
Dropped the pandas dependency since that was added already. Rebased and pushed to see what the current state of things are. |
With xarray-contrib/cf-xarray#122 and xarray-contrib/cf-xarray#257, would it be worth leaving this functionality (and #1121) for cf-xarray to handle upstream? |
I could use some help with reviewing xarray-contrib/cf-xarray#260 if someone here is up for it |
See #1121 (comment). |
New module in
metpy.io
to convert DataFrames to netCDF files, with intention to make them CF-compliant DSG files if enough user input is given. This was tested against METAR DataFrames parsed by @mgrover1, but can be used for time series, profiles, and trajectories. I'm open to suggestions on revisions from anyone, as this is a first attempt at writing these files. I also wrote it as a class - not sure if that is correct form. Again, first time doing that as well.Note: currently, this only supports writing the files, although the intention is to make them appendable. However, through Xarray, this is not supported currently (see: pydata/xarray#1672).