API: Gather and other inspiration from tidyr #10109

datnamer · 2015-05-12T02:08:07Z

http://connor-johnson.com/2014/08/28/tidyr-and-pandas-gather-and-melt/

In the spirit of the excellent assign method, wondering if there is support for some tidyr style transformations?

shoyer · 2015-05-12T02:32:49Z

Hadley Wickham is brilliant at API design, so I'm always happy to use his work for inspiration. Concrete suggestions would be helpful.

At the very least, gather looks like a broadly useful method to add to pandas dataframes. It seems like an improved version of melt, which we already borrowed from Hadley. So +1 from me, though I'd wait for a few other core devs to chime in (@TomAugspurger?) before you start writing your PR.

jreback · 2015-05-12T10:42:02Z

This already exists (sort of). as wide_to_long. +1 on changing the name to gather, better API (this uses regexes to 'infer' the variable names).

In [26]: df
Out[26]: 
    a   b    names
0  67  56   Wilbur
1  80  90  Petunia
2  64  50  Gregory

In [27]: df2 = df.rename(columns={'a' : 'A1', 'b' : 'B2'})

In [28]: df2
Out[28]: 
   A1  B2    names
0  67  56   Wilbur
1  80  90  Petunia
2  64  50  Gregory

In [29]: pd.wide_to_long(df2,['A|B'],'names','heartrate')
Out[29]: 
                   A|B
names   heartrate     
Wilbur  1           67
Petunia 1           80
Gregory 1           64
Wilbur  2           56
Petunia 2           90
Gregory 2           50

jorisvandenbossche · 2015-05-12T11:36:57Z

I just wanted to give the same reference to wide_to_long
But I think that is exactly an example of how to not do it (in the sense of "I have a specific case, lets add a function for that"). To be clear, at least not for the scope of pandas (or at least top-level namespace of pandas).

In that sense, I am -1 on just adding 'yet another reshape-like' function, before thinking it a bit more through (but I fully agree that the current reshape functionality (melt and wide_to_long) can use some love to make it more flexible and user friendly):

Is it possible to improve melt without breaking back compat (I don't know if I find the name 'gather' that much better)
How do we keep the interface somewhat consistent between the 'opposite' functions (pivot, pivot_table, spread in tidyr)
Maybe someone should rather write a 'pandas-tidy' package that implements these functions based on pandas first?
Go through the different problem cases of reshaping, see how it can currently done in pandas, what is good, what is lacking, and how can we improve this?

The problem is a bit that pandas is becoming a monolithic package. Hadley Wickham has indeed very nice API's (and we can learn a lot from that to strive for in pandas). But, if he has a new idea, he just starts a new package. For example, the current melt function in pandas is based on his older reshape(2) package. Now you have tidyr, and we could get inspiration from there. But then maybe next, there is another package, en want can't keep adding functions .. (or, we can, but the question is if we want this).
But of course, this is a much more general discussion ..

datnamer · 2015-05-12T13:12:54Z

-1 on my own suggestion and further overloading the pandas namespace and +1 on using simple, composable building block abstractions and maybe starting a new package. Pydata is great for consistency, but R has more rapid diffuse iterative innovation...I wonder if we can help foster the latter.

kay1793 · 2015-05-12T14:02:29Z

That's well put, but I'm forced to strongly disagree. While as you say R unarguably benefits from a rapid diffuse iterative innovation, If you really examine the issue closely you must realize that pydata tools tend to intentionally embrace a more focused decentralized convergent behaviour-driven amalgamation approach, one that is inherently aspect-oriented and inline with the best-of-breed theories of cloud-first which rain supreme over this exciting new age of "stuff".

datnamer · 2015-05-12T14:38:58Z

Sorry, I don't understand what that means.

shoyer · 2015-05-12T22:04:39Z

Indeed, a separate package is probably a better place to start. The only unfortunate bit about doing this outside of pandas is that users can't do method chaining with third party packages.

@kay1793 Please don't troll.

datnamer · 2015-05-12T23:18:57Z

But what good trolling it was :) No I'm just kidding, it was rude and it had me going because he almost had a point there.

This method chaining issue perfectly illustrates my somewhat densely articulated point: In R, new packages spring up all the time, iterate on other packages and are connected using pipes. In python (statsmodels for example) users go through a long and arduous PR process to include in packages, thus then increasing maintenance burden and decreasing motivation/ sense of ownership to maintain my own code.

While the code quality is more variable and APIs are less consistent, the vibrant package landscape, binded by cran and piping, makes up for it in a sense. Sure we can write our own libraries in python, but without a CRAN like thing, they end up languishing unmaintained in corners of github. In the meantime, its harder to push to the primary packages and innovation there diminishes as maintenance takes up a higher proportion of reviewer time.

Forgive me for the digression, but I think this is tangentially related and critical for pydata. The end result is that it seems the landscape in R is advancing much faster (aided in no small part by Dr. Wickham of course). Other variables include additional R moocs, but my point stands regardless.

shoyer · 2015-05-12T23:36:34Z

@datnamer I agree. I recently made a similar argument as part of a push by @mrocklin for adding macros to Python: https://mail.python.org/pipermail/python-ideas/2015-March/032822.html

There may also be less extreme ways to achieve the same result... if you have ideas about things we can do, I'm all ears.

datnamer · 2015-05-12T23:48:21Z

@shoyer: Hmmm.... I really think the core pandas guys, Matt Rocklin, Travis, the Pandas people etc need to get together for a serious brainstorming session on this if python is to keep up in the near and distant future.

We need to encourage innovation, modularity and ease of use.

I think the low hanging fruit is to improve the sense of connectivity, idea dispersion and utilization of third party packages in the pydata community. Some sort of pydata bloggers and a CRAN like task view database is important.

Regarding the chaining issue... I'm not so up on the technical details....but this looks promising: https://github.com/dalejung/naginpy Is there any reason it can't be built out and/or work interactively? Can context managers be used in some way?

Is Matt Rocklin still pursuing this macro idea?

shoyer · 2015-05-12T23:54:22Z

This seems like a great topic for a BoF session at SciPy 2015... anyone else interested in co-organizing?

mrocklin · 2015-05-13T00:08:43Z

Is Matt Rocklin still pursuing this macro idea?

In my free time, which is to say "Not at the moment". But I was pleasantly surprised by a warm response to the idea by a number of people at PyCon. I'll send out feelers to see if anyone is gung ho about pushing it forward. I think that the next step is to spec out a design and actually implement a proof of concept. CPython hackers welcome.

datnamer · 2015-05-13T03:01:11Z

Interesting. I wonder if @dalejung has thoughts on this?

jreback · 2015-05-13T11:12:48Z

While the code quality is more variable and APIs are less consistent, the vibrant package landscape, binded by cran and piping, makes up for it in a sense. Sure we can write our own libraries in python, but without a CRAN like thing, they end up languishing unmaintained in corners of github. In the meantime, its harder to push to the primary packages and innovation there diminishes as maintenance takes up a higher proportion of reviewer time.

Every heard of PyPI? You are WAY underestimating the importance of consistent API's. R has succeeded to some extent, IN SPITE of this major major problem. In fact, I would argue that they are moving more and more toward curated type of packages (e.g. dplyr), that HAVE a consistent API scheme. If something is worthwhile & popular then you would expect that it would eventually be included in a mainstream package.

pandas is exactly this model.

The point of a 'curated' model is that not only do you get consistency, you get a best practices one-way-to-do-it. You don't have to search around and figure out 'how do I do X'. You get support and bug fixes. How many one-of-a-kind R packages have this? Sure they may have some value to a small group of people, great. But truly is this a package system that you would want to actually rely upon?

The biggest benefit, however, of a package like pandas is that you get distribution. Once a feature is accepted into pandas, then it immediately becomes available to a pretty large community, is announced at release time, and is supported. I think you'd have a hard time saying the same about virtually any grass-roots packages (in R or Python), unless, they are more 'mainstream'.

my 2c. (and I do agree that the StatsModels is not iterating fast enough for the community, but there also is not a lot of community support, as compared to say scikit-learn or many R packages).

dalejung · 2015-05-14T01:12:04Z

I've gone the way of adding the wackiness through integrated tooling. I'm not sure what the likelihood of Python adding macro capabilities and even then I imagine they'd be too sensible for my tastes. The features I want out of a lab environment are commonly bad practice for library development :/

pwwang · 2022-03-17T19:07:06Z

For late-comers, here is a transformation with datar:

>>> from datar.all import c, f, tibble, pivot_longer
>>> df = tibble(
...   name = c("Wilbur", "Petunia", "Gregory"),
...   a = c(67, 80, 64),
...   b = c(56, 90, 50)
... )
>>> df
      name       a       b
  <object> <int64> <int64>
0   Wilbur      67      56
1  Petunia      80      90
2  Gregory      64      50
>>> df >> pivot_longer(~f.name, names_to="TREATMENT", values_to="HEART RATE")
      name TREATMENT  HEART RATE
  <object>  <object>     <int64>
0   Wilbur         a          67
1  Petunia         a          80
2  Gregory         a          64
3   Wilbur         b          56
4  Petunia         b          90
5  Gregory         b          50

jbrockmendel · 2023-02-22T22:13:26Z

Discussed on today's dev call, consensus was that if wide_to_long already handles this, we don't want another function. Closing.

datnamer changed the title ~~API: inspiration from tidyr~~ API: Gather and other inspiration from tidyr May 12, 2015

shoyer added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Effort Medium labels May 12, 2015

shoyer modified the milestones: 0.17.0, Next Major Release May 12, 2015

jorisvandenbossche mentioned this issue Feb 27, 2017

API: add top-level functions as method #15513

Closed

9 tasks

jorisvandenbossche mentioned this issue Apr 4, 2017

API: add top-level melt function as method #15521

Closed

5 tasks

jbrockmendel removed Effort Medium labels Oct 21, 2019

mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action and removed API Design labels Apr 18, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

jbrockmendel closed this as completed Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: Gather and other inspiration from tidyr #10109

API: Gather and other inspiration from tidyr #10109

datnamer commented May 12, 2015

shoyer commented May 12, 2015

jreback commented May 12, 2015

jorisvandenbossche commented May 12, 2015

datnamer commented May 12, 2015

kay1793 commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

mrocklin commented May 13, 2015

datnamer commented May 13, 2015

jreback commented May 13, 2015

dalejung commented May 14, 2015

pwwang commented Mar 17, 2022

jbrockmendel commented Feb 22, 2023

API: Gather and other inspiration from tidyr #10109

API: Gather and other inspiration from tidyr #10109

Comments

datnamer commented May 12, 2015

shoyer commented May 12, 2015

jreback commented May 12, 2015

jorisvandenbossche commented May 12, 2015

datnamer commented May 12, 2015

kay1793 commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

datnamer commented May 12, 2015

shoyer commented May 12, 2015

mrocklin commented May 13, 2015

datnamer commented May 13, 2015

jreback commented May 13, 2015

dalejung commented May 14, 2015

pwwang commented Mar 17, 2022

jbrockmendel commented Feb 22, 2023