WIP: revise top-level package description #2430

rabernat · 2018-09-22T15:35:47Z

I have often complained that xarray's top-level package description assumes that the user knows all about pandas. I think this alienates many new users.

This is a first draft at revising that top-level description. Feedback from the community very needed here.

doc/index.rst

fujiisoup · 2018-09-23T01:19:30Z

Thanks, @rabernat

I have often complained that xarray's top-level package description assumes that the user knows all about pandas.

Agreed!

I remember when I first met xarray, I didn't understand well what the word 'labelled data' means.
I don't think this terminology is very common.
It may be nice if we have more explicit definition something like 'xarray provides a data structure to handle a data array and its coordinates consistently'.

jhamman · 2018-09-23T04:41:16Z

doc/index.rst

+
+Labelled multi-dimensional (a.k.a. N-dimensional) arrays are encountered in
+many fields, especially physical sciences, engineering, and finance.
+But multi-dimensional data doesn't fit neatly into pandas_, python's most


If we are going to contrast directly with Pandas, I think we need to say what Pandas is first. Maybe also provide an example of what Pandas does (tabular data structures).

I agree that we can't assume that readers know what Pandas is - I certainly didn't. I think that users coming from a more data science background will have used Pandas but those coming from a more low-level array-based numpy/MATLAB/Fortran/C++ point-of-view won't have (e.g. all the physicists I work with).

I also think including an explicit example of a labelled data structure in this explanation would go a long way, the printable representation of an xarray Dataset gives a good idea of how it labels the data it contains.

jhamman · 2018-09-23T04:44:07Z

As a reference, we recently wrote some similar prose for xarray's numfocus page:

https://numfocus.org/project/xarray

Xarray is an open source library providing high-level, easy-to-use data structures and analysis tools for working with multidimensional labeled datasets and arrays in Python.

Xarray is a Python library that provides data structures and tools for working with multidimensional labeled datasets and arrays. Xarray enables users to perform operations on complex datasets. Xarray interoperates with many of the core libraries in the scientific Python ecosystem making it a powerful high-level tool for data analysis.

Xarray has been used in a wide variety of academic and industry contexts for applications as varied as weather/climate, computational physics, astronomy, biology, econometrics, machine learning and finance. It is a core component of Pangeo, a community platform for Big Data geoscience.

Examples of results enabled by xarray include:

modeling the environmental and socioeconomic impacts of climate change
understanding the life cycle of viruses from single-cell RNA sequencing data
measuring the speed of galaxies in a telescope survey

dcherian

I had trouble with the phrase "labelled data" too. I've added an example that maybe helps clear that bit up.

doc/index.rst

alexamici · 2018-10-22T17:44:30Z

doc/index.rst

+popular data analysis package focused on label tabular data.
+Xarray provides a pandas-like and pandas-compatible toolkit for
+analytics on multi-dimensional arrays.
+Our approach adopts the `Common Data Model`_ for self-
 describing scientific data in widespread use in the Earth sciences:
 ``xarray.Dataset`` is an in-memory representation of a netCDF file.


This is not completely accurate, an xarray.Dataset represents a netCDF-3 or netCDF-4 classic file, but only one of the Groups in a netCDF-4 file with the new netCDF-4 Data Model https://www.unidata.ucar.edu/software/netcdf/workshops/2011/datamodels/Nc4-uml.html (compatible but not identical with the cited Common Data Model). This may sound pedantic at this level, but I found the subtleties of the netCDF 3/4 data models very hard to grasp once I had the mental map between an xarray.Dataset and a netCDF-4 File.

IMHO the best is to keep the reference to the Unidata Common Data Model as xarray uses the extended type system and add a quick reference to the CDM concept of a Group.

Co-Authored-By: rabernat <[email protected]>

rabernat · 2019-01-04T20:09:31Z

Given this recent twitter thread, I think we should revive and finish this off.

rabernat · 2019-01-04T20:59:07Z

Based on the comments I received, I have written a second draft of a revised top-level description.

shoyer · 2019-01-04T21:05:33Z

doc/index.rst

+Xarray also provides a large and growing library of functions for advanced
+analytics and visualization with these data structures.
+Xarray was inspired by and borrows heavily from pandas_, a highly popular data
+analysis package focused on labelled tabular data.


It would be nice to still see the words "netCDF" somewhere (or maybe that's implicit in our mentioning of the "Common Data Model"?).

Roughly speaking we have three audiences here:

NumPy users who want labels

pandas users who want to work with higher-dimensional data

netCDF users who want good in-memory data-structures

I removed it in response to @alexamici's comments. But in retrospect I agree that it belongs there. (I personally had never heard of CDM before xarray.)

I would prioritize mentioning netCDF over the CDM and maybe drop CDM entirely from the brief intro. I don't think many people know what the "common data model" refers to, and worse it seems to be a heavily overloaded term, even in technical contexts (e.g., the top hit from Google is something unrelated from Microsoft).

Roughly speaking we have three audiences here:

* NumPy users who want labels * pandas users who want to work with higher-dimensional data * netCDF users who want good in-memory data-structures

This seems key enough that I might even put this somewhere in the docs?

and

* pandas users who want to work with higher-dimensional data
->
* pandas users who want to work with higher-dimensional data and an explicit, production-capable API

This might be good stuff to add to the “Why xarray” page.

shoyer · 2019-01-04T21:46:35Z

doc/index.rst

+are an essential part of computational science.
+They are encountered in a wide range of fields, including physics, astronomy,
+geoscience, bioinformatics, engineering, finance, and deep learning.
+In python, numpy_ provides the fundamental data structure and API for


nit: numpy -> NumPy

shoyer · 2019-01-04T21:47:44Z

doc/index.rst

+However, real-world datasets are usually more than just raw numbers;
+they have "labels" which encode information about how the array values map
+to locations in space, time, etc.
+By adopting the the `Common Data Model`_ for self-describing scientific data,


the the -> the

Maybe (I'm not actually sure htis is better):
By adopting the self-describing data model of the netCDF file format

shoyer · 2019-01-04T21:52:52Z

This looks great now. Could you also kindly copy it into our setup.py and README.rst files?

alexamici · 2019-01-04T22:25:10Z

doc/index.rst

@@ -2,19 +2,33 @@ xarray: N-D labeled arrays and datasets in Python
 =================================================

 **xarray** (formerly **xray**) is an open source project and Python package


@shoyer can we drop the reference to xray? The set of people that know the old xray and don't know the new xarray name is probably next to empty.

Sadly, just today in the twitter thread under discussion, someone referenced xray and linked to the v0.2 documentation. 🤦‍♂️

max-sixty · 2019-01-05T06:28:31Z

doc/index.rst

+In python, numpy_ provides the fundamental data structure and API for
+working with raw ND arrays.
+However, real-world datasets are usually more than just raw numbers;
+they have "labels" which encode information about how the array values map


Not sure we need " around labels?

max-sixty · 2019-01-05T06:31:17Z

This is looking great!

rabernat · 2019-01-05T16:20:06Z

Ready I think.

spencerkclark · 2019-01-05T17:49:47Z

doc/index.rst

+that makes working with labelled multi-dimensional arrays simple,
+efficient, and fun!
+
+Multi-dimensional (a.k.a. N-dimensional, ND) arrays (somtimes called "tensors")


somtimes -> sometimes

shoyer · 2019-01-05T20:06:59Z

@rabernat I pushed some minor tweaks to your branch, please take a look!

rabernat · 2019-01-06T00:01:14Z

@shoyer - all your changes are 👍 with me.

* revise main package description * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * next draft * add mention of netCDF * eliminate CDM reference * update README and setup.py * Split long paragraph, minor rewordings

shoyer · 2019-01-07T01:04:19Z

I spent a few more hours working on this this afternoon -- please take a look at #2657!

* master: Remove broken Travis-CI builds (pydata#2661) Type checking with mypy (pydata#2655) Added Coarsen (pydata#2612) Improve test for GH 2649 (pydata#2654) revise top-level package description (pydata#2430) Convert ref_date to UTC in encode_cf_datetime (pydata#2651) Change an `==` to an `is`. Fix tests so that this won't happen again. (pydata#2648) ENH: switch Dataset and DataArray to use explicit indexes (pydata#2639) Use pycodestyle for lint checks. (pydata#2642) Switch whats-new for 0.11.2 -> 0.11.3 DOC: document v0.11.2 release Use built-in interp for interpolation with resample (pydata#2640) BUG: pytest-runner no required for setup.py (pydata#2643)

* revise main package description * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * Update doc/index.rst Co-Authored-By: rabernat <[email protected]> * next draft * add mention of netCDF * eliminate CDM reference * update README and setup.py * Split long paragraph, minor rewordings

revise main package description

2df8de5

rabernat requested review from shoyer and jhamman September 22, 2018 15:35

max-sixty reviewed Sep 22, 2018

View reviewed changes

doc/index.rst Outdated Show resolved Hide resolved

jhamman reviewed Sep 23, 2018

View reviewed changes

dcherian reviewed Oct 23, 2018

View reviewed changes

doc/index.rst Outdated Show resolved Hide resolved

doc/index.rst Outdated Show resolved Hide resolved

doc/index.rst Outdated Show resolved Hide resolved

alexamici reviewed Oct 23, 2018

View reviewed changes

dcherian and others added 3 commits January 4, 2019 21:07

Update doc/index.rst

ca8aa33

Co-Authored-By: rabernat <[email protected]>

Update doc/index.rst

40854dd

Co-Authored-By: rabernat <[email protected]>

Update doc/index.rst

9901f1e

Co-Authored-By: rabernat <[email protected]>

next draft

253baf9

shoyer reviewed Jan 4, 2019

View reviewed changes

rabernat added 2 commits January 4, 2019 22:26

add mention of netCDF

6f84e5a

eliminate CDM reference

ec11b01

shoyer reviewed Jan 4, 2019

View reviewed changes

alexamici reviewed Jan 4, 2019

View reviewed changes

max-sixty reviewed Jan 5, 2019

View reviewed changes

update README and setup.py

96ac31d

spencerkclark reviewed Jan 5, 2019

View reviewed changes

Split long paragraph, minor rewordings

085a5dd

dcherian merged commit a0bbea8 into pydata:master Jan 6, 2019

shoyer mentioned this pull request Jan 7, 2019

DOC: refresh "Why xarray" and shorten top-level description #2657

Merged

jhamman mentioned this pull request Feb 2, 2019

netCDF reading is not prominent in the docs #1154

Closed

rabernat mentioned this pull request Feb 26, 2019

description of xarray assumes knowledge of pandas #1282

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: revise top-level package description #2430

WIP: revise top-level package description #2430

rabernat commented Sep 22, 2018

fujiisoup commented Sep 23, 2018 •

edited

Loading

jhamman Sep 23, 2018

TomNicholas Oct 31, 2018 •

edited

Loading

jhamman commented Sep 23, 2018 •

edited

Loading

dcherian left a comment

alexamici Oct 22, 2018

rabernat commented Jan 4, 2019

rabernat commented Jan 4, 2019

shoyer Jan 4, 2019

rabernat Jan 4, 2019

shoyer Jan 4, 2019

max-sixty Jan 5, 2019

shoyer Jan 5, 2019

shoyer Jan 4, 2019

shoyer Jan 4, 2019

shoyer Jan 4, 2019

shoyer commented Jan 4, 2019

alexamici Jan 4, 2019

rabernat Jan 4, 2019

max-sixty Jan 5, 2019

rabernat Jan 5, 2019

max-sixty commented Jan 5, 2019

rabernat commented Jan 5, 2019

spencerkclark Jan 5, 2019

shoyer commented Jan 5, 2019

rabernat commented Jan 6, 2019

shoyer commented Jan 7, 2019

		@@ -2,19 +2,33 @@ xarray: N-D labeled arrays and datasets in Python
		=================================================

		xarray (formerly xray) is an open source project and Python package

WIP: revise top-level package description #2430

WIP: revise top-level package description #2430

Conversation

rabernat commented Sep 22, 2018

fujiisoup commented Sep 23, 2018 • edited Loading

Choose a reason for hiding this comment

TomNicholas Oct 31, 2018 • edited Loading

Choose a reason for hiding this comment

jhamman commented Sep 23, 2018 • edited Loading

dcherian left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat commented Jan 4, 2019

rabernat commented Jan 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jan 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Jan 5, 2019

rabernat commented Jan 5, 2019

Choose a reason for hiding this comment

shoyer commented Jan 5, 2019

rabernat commented Jan 6, 2019

shoyer commented Jan 7, 2019

fujiisoup commented Sep 23, 2018 •

edited

Loading

TomNicholas Oct 31, 2018 •

edited

Loading

jhamman commented Sep 23, 2018 •

edited

Loading