-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: revise top-level package description #2430
Conversation
Thanks, @rabernat
Agreed! I remember when I first met xarray, I didn't understand well what the word 'labelled data' means. |
doc/index.rst
Outdated
|
||
Labelled multi-dimensional (a.k.a. N-dimensional) arrays are encountered in | ||
many fields, especially physical sciences, engineering, and finance. | ||
But multi-dimensional data doesn't fit neatly into pandas_, python's most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are going to contrast directly with Pandas, I think we need to say what Pandas is first. Maybe also provide an example of what Pandas does (tabular data structures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can't assume that readers know what Pandas is - I certainly didn't. I think that users coming from a more data science background will have used Pandas but those coming from a more low-level array-based numpy/MATLAB/Fortran/C++ point-of-view won't have (e.g. all the physicists I work with).
I also think including an explicit example of a labelled data structure in this explanation would go a long way, the printable representation of an xarray Dataset gives a good idea of how it labels the data it contains.
As a reference, we recently wrote some similar prose for xarray's numfocus page: https://numfocus.org/project/xarray
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had trouble with the phrase "labelled data" too. I've added an example that maybe helps clear that bit up.
doc/index.rst
Outdated
popular data analysis package focused on label tabular data. | ||
Xarray provides a pandas-like and pandas-compatible toolkit for | ||
analytics on multi-dimensional arrays. | ||
Our approach adopts the `Common Data Model`_ for self- | ||
describing scientific data in widespread use in the Earth sciences: | ||
``xarray.Dataset`` is an in-memory representation of a netCDF file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not completely accurate, an xarray.Dataset
represents a netCDF-3 or netCDF-4 classic file, but only one of the Groups in a netCDF-4 file with the new netCDF-4 Data Model https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/Nc4-uml.html (compatible but not identical with the cited Common Data Model). This may sound pedantic at this level, but I found the subtleties of the netCDF 3/4 data models very hard to grasp once I had the mental map between an xarray.Dataset
and a netCDF-4 File.
IMHO the best is to keep the reference to the Unidata Common Data Model as xarray uses the extended type system and add a quick reference to the CDM concept of a Group.
Co-Authored-By: rabernat <[email protected]>
Co-Authored-By: rabernat <[email protected]>
Co-Authored-By: rabernat <[email protected]>
Given this recent twitter thread, I think we should revive and finish this off. |
Based on the comments I received, I have written a second draft of a revised top-level description. |
Xarray also provides a large and growing library of functions for advanced | ||
analytics and visualization with these data structures. | ||
Xarray was inspired by and borrows heavily from pandas_, a highly popular data | ||
analysis package focused on labelled tabular data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to still see the words "netCDF" somewhere (or maybe that's implicit in our mentioning of the "Common Data Model"?).
Roughly speaking we have three audiences here:
- NumPy users who want labels
- pandas users who want to work with higher-dimensional data
- netCDF users who want good in-memory data-structures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it in response to @alexamici's comments. But in retrospect I agree that it belongs there. (I personally had never heard of CDM before xarray.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prioritize mentioning netCDF over the CDM and maybe drop CDM entirely from the brief intro. I don't think many people know what the "common data model" refers to, and worse it seems to be a heavily overloaded term, even in technical contexts (e.g., the top hit from Google is something unrelated from Microsoft).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roughly speaking we have three audiences here:
* NumPy users who want labels * pandas users who want to work with higher-dimensional data * netCDF users who want good in-memory data-structures
This seems key enough that I might even put this somewhere in the docs?
and
* pandas users who want to work with higher-dimensional data
->
* pandas users who want to work with higher-dimensional data and an explicit, production-capable API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be good stuff to add to the “Why xarray” page.
doc/index.rst
Outdated
are an essential part of computational science. | ||
They are encountered in a wide range of fields, including physics, astronomy, | ||
geoscience, bioinformatics, engineering, finance, and deep learning. | ||
In python, numpy_ provides the fundamental data structure and API for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: numpy
-> NumPy
doc/index.rst
Outdated
However, real-world datasets are usually more than just raw numbers; | ||
they have "labels" which encode information about how the array values map | ||
to locations in space, time, etc. | ||
By adopting the the `Common Data Model`_ for self-describing scientific data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the the
-> the
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe (I'm not actually sure htis is better):
By adopting the self-describing data model of the netCDF file format
This looks great now. Could you also kindly copy it into our |
@@ -2,19 +2,33 @@ xarray: N-D labeled arrays and datasets in Python | |||
================================================= | |||
|
|||
**xarray** (formerly **xray**) is an open source project and Python package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer can we drop the reference to xray? The set of people that know the old xray and don't know the new xarray name is probably next to empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sadly, just today in the twitter thread under discussion, someone referenced xray and linked to the v0.2 documentation. 🤦♂️
doc/index.rst
Outdated
In python, numpy_ provides the fundamental data structure and API for | ||
working with raw ND arrays. | ||
However, real-world datasets are usually more than just raw numbers; | ||
they have "labels" which encode information about how the array values map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we need "
around labels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
This is looking great! |
Ready I think. |
doc/index.rst
Outdated
that makes working with labelled multi-dimensional arrays simple, | ||
efficient, and fun! | ||
|
||
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (somtimes called "tensors") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somtimes
-> sometimes
@rabernat I pushed some minor tweaks to your branch, please take a look! |
@shoyer - all your changes are 👍 with me. |
* revise main package description * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * next draft * add mention of netCDF * eliminate CDM reference * update README and setup.py * Split long paragraph, minor rewordings
I spent a few more hours working on this this afternoon -- please take a look at #2657! |
* master: Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612) Improve test for GH 2649 (pydata#2654) revise top-level package description (pydata#2430) Convert ref_date to UTC in encode_cf_datetime (pydata#2651) Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648) ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639) Use pycodestyle for lint checks. (pydata#2642) Switch whats-new for 0.11.2 -> 0.11.3 DOC: document v0.11.2 release Use built-in interp for interpolation with resample (pydata#2640) BUG: pytest-runner no required for setup.py (pydata#2643)
* revise main package description * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * next draft * add mention of netCDF * eliminate CDM reference * update README and setup.py * Split long paragraph, minor rewordings
I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users.
This is a first draft at revising that top-level description. Feedback from the community very needed here.