Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Gather and other inspiration from tidyr #10109

Closed
datnamer opened this issue May 12, 2015 · 17 comments
Closed

API: Gather and other inspiration from tidyr #10109

datnamer opened this issue May 12, 2015 · 17 comments
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@datnamer
Copy link

http://connor-johnson.com/2014/08/28/tidyr-and-pandas-gather-and-melt/

In the spirit of the excellent assign method, wondering if there is support for some tidyr style transformations?

@datnamer datnamer changed the title API: inspiration from tidyr API: Gather and other inspiration from tidyr May 12, 2015
@shoyer shoyer added Reshaping Concat, Merge/Join, Stack/Unstack, Explode API Design Effort Medium labels May 12, 2015
@shoyer shoyer modified the milestones: 0.17.0, Next Major Release May 12, 2015
@shoyer
Copy link
Member

shoyer commented May 12, 2015

Hadley Wickham is brilliant at API design, so I'm always happy to use his work for inspiration. Concrete suggestions would be helpful.

At the very least, gather looks like a broadly useful method to add to pandas dataframes. It seems like an improved version of melt, which we already borrowed from Hadley. So +1 from me, though I'd wait for a few other core devs to chime in (@TomAugspurger?) before you start writing your PR.

@jreback
Copy link
Contributor

jreback commented May 12, 2015

This already exists (sort of). as wide_to_long. +1 on changing the name to gather, better API (this uses regexes to 'infer' the variable names).

In [26]: df
Out[26]: 
    a   b    names
0  67  56   Wilbur
1  80  90  Petunia
2  64  50  Gregory

In [27]: df2 = df.rename(columns={'a' : 'A1', 'b' : 'B2'})

In [28]: df2
Out[28]: 
   A1  B2    names
0  67  56   Wilbur
1  80  90  Petunia
2  64  50  Gregory

In [29]: pd.wide_to_long(df2,['A|B'],'names','heartrate')
Out[29]: 
                   A|B
names   heartrate     
Wilbur  1           67
Petunia 1           80
Gregory 1           64
Wilbur  2           56
Petunia 2           90
Gregory 2           50

@jorisvandenbossche
Copy link
Member

I just wanted to give the same reference to wide_to_long
But I think that is exactly an example of how to not do it (in the sense of "I have a specific case, lets add a function for that"). To be clear, at least not for the scope of pandas (or at least top-level namespace of pandas).

In that sense, I am -1 on just adding 'yet another reshape-like' function, before thinking it a bit more through (but I fully agree that the current reshape functionality (melt and wide_to_long) can use some love to make it more flexible and user friendly):

  • Is it possible to improve melt without breaking back compat (I don't know if I find the name 'gather' that much better)
  • How do we keep the interface somewhat consistent between the 'opposite' functions (pivot, pivot_table, spread in tidyr)
  • Maybe someone should rather write a 'pandas-tidy' package that implements these functions based on pandas first?
  • Go through the different problem cases of reshaping, see how it can currently done in pandas, what is good, what is lacking, and how can we improve this?

The problem is a bit that pandas is becoming a monolithic package. Hadley Wickham has indeed very nice API's (and we can learn a lot from that to strive for in pandas). But, if he has a new idea, he just starts a new package. For example, the current melt function in pandas is based on his older reshape(2) package. Now you have tidyr, and we could get inspiration from there. But then maybe next, there is another package, en want can't keep adding functions .. (or, we can, but the question is if we want this).
But of course, this is a much more general discussion ..

@datnamer
Copy link
Author

-1 on my own suggestion and further overloading the pandas namespace and +1 on using simple, composable building block abstractions and maybe starting a new package. Pydata is great for consistency, but R has more rapid diffuse iterative innovation...I wonder if we can help foster the latter.

@kay1793
Copy link

kay1793 commented May 12, 2015

That's well put, but I'm forced to strongly disagree. While as you say R unarguably benefits from a rapid diffuse iterative innovation, If you really examine the issue closely you must realize that pydata tools tend to intentionally embrace a more focused decentralized convergent behaviour-driven amalgamation approach, one that is inherently aspect-oriented and inline with the best-of-breed theories of cloud-first which rain supreme over this exciting new age of "stuff".

@datnamer
Copy link
Author

Sorry, I don't understand what that means.

@shoyer
Copy link
Member

shoyer commented May 12, 2015

Indeed, a separate package is probably a better place to start. The only unfortunate bit about doing this outside of pandas is that users can't do method chaining with third party packages.

@kay1793 Please don't troll.

@datnamer
Copy link
Author

But what good trolling it was :) No I'm just kidding, it was rude and it had me going because he almost had a point there.

This method chaining issue perfectly illustrates my somewhat densely articulated point: In R, new packages spring up all the time, iterate on other packages and are connected using pipes. In python (statsmodels for example) users go through a long and arduous PR process to include in packages, thus then increasing maintenance burden and decreasing motivation/ sense of ownership to maintain my own code.

While the code quality is more variable and APIs are less consistent, the vibrant package landscape, binded by cran and piping, makes up for it in a sense. Sure we can write our own libraries in python, but without a CRAN like thing, they end up languishing unmaintained in corners of github. In the meantime, its harder to push to the primary packages and innovation there diminishes as maintenance takes up a higher proportion of reviewer time.

Forgive me for the digression, but I think this is tangentially related and critical for pydata. The end result is that it seems the landscape in R is advancing much faster (aided in no small part by Dr. Wickham of course). Other variables include additional R moocs, but my point stands regardless.

@shoyer
Copy link
Member

shoyer commented May 12, 2015

@datnamer I agree. I recently made a similar argument as part of a push by @mrocklin for adding macros to Python: https://mail.python.org/pipermail/python-ideas/2015-March/032822.html

There may also be less extreme ways to achieve the same result... if you have ideas about things we can do, I'm all ears.

@datnamer
Copy link
Author

@shoyer: Hmmm.... I really think the core pandas guys, Matt Rocklin, Travis, the Pandas people etc need to get together for a serious brainstorming session on this if python is to keep up in the near and distant future.

We need to encourage innovation, modularity and ease of use.

I think the low hanging fruit is to improve the sense of connectivity, idea dispersion and utilization of third party packages in the pydata community. Some sort of pydata bloggers and a CRAN like task view database is important.

Regarding the chaining issue... I'm not so up on the technical details....but this looks promising: https://github.com/dalejung/naginpy Is there any reason it can't be built out and/or work interactively? Can context managers be used in some way?

Is Matt Rocklin still pursuing this macro idea?

@shoyer
Copy link
Member

shoyer commented May 12, 2015

This seems like a great topic for a BoF session at SciPy 2015... anyone else interested in co-organizing?

@mrocklin
Copy link
Contributor

Is Matt Rocklin still pursuing this macro idea?

In my free time, which is to say "Not at the moment". But I was pleasantly surprised by a warm response to the idea by a number of people at PyCon. I'll send out feelers to see if anyone is gung ho about pushing it forward. I think that the next step is to spec out a design and actually implement a proof of concept. CPython hackers welcome.

@datnamer
Copy link
Author

Interesting. I wonder if @dalejung has thoughts on this?

@jreback
Copy link
Contributor

jreback commented May 13, 2015

While the code quality is more variable and APIs are less consistent, the vibrant package landscape, binded by cran and piping, makes up for it in a sense. Sure we can write our own libraries in python, but without a CRAN like thing, they end up languishing unmaintained in corners of github. In the meantime, its harder to push to the primary packages and innovation there diminishes as maintenance takes up a higher proportion of reviewer time.

Every heard of PyPI? You are WAY underestimating the importance of consistent API's. R has succeeded to some extent, IN SPITE of this major major problem. In fact, I would argue that they are moving more and more toward curated type of packages (e.g. dplyr), that HAVE a consistent API scheme. If something is worthwhile & popular then you would expect that it would eventually be included in a mainstream package.

pandas is exactly this model.

The point of a 'curated' model is that not only do you get consistency, you get a best practices one-way-to-do-it. You don't have to search around and figure out 'how do I do X'. You get support and bug fixes. How many one-of-a-kind R packages have this? Sure they may have some value to a small group of people, great. But truly is this a package system that you would want to actually rely upon?

The biggest benefit, however, of a package like pandas is that you get distribution. Once a feature is accepted into pandas, then it immediately becomes available to a pretty large community, is announced at release time, and is supported. I think you'd have a hard time saying the same about virtually any grass-roots packages (in R or Python), unless, they are more 'mainstream'.

my 2c. (and I do agree that the StatsModels is not iterating fast enough for the community, but there also is not a lot of community support, as compared to say scikit-learn or many R packages).

@dalejung
Copy link
Contributor

I've gone the way of adding the wackiness through integrated tooling. I'm not sure what the likelihood of Python adding macro capabilities and even then I imagine they'd be too sensible for my tastes. The features I want out of a lab environment are commonly bad practice for library development :/

@pwwang
Copy link

pwwang commented Mar 17, 2022

For late-comers, here is a transformation with datar:

>>> from datar.all import c, f, tibble, pivot_longer
>>> df = tibble(
...   name = c("Wilbur", "Petunia", "Gregory"),
...   a = c(67, 80, 64),
...   b = c(56, 90, 50)
... )
>>> df
      name       a       b
  <object> <int64> <int64>
0   Wilbur      67      56
1  Petunia      80      90
2  Gregory      64      50
>>> df >> pivot_longer(~f.name, names_to="TREATMENT", values_to="HEART RATE")
      name TREATMENT  HEART RATE
  <object>  <object>     <int64>
0   Wilbur         a          67
1  Petunia         a          80
2  Gregory         a          64
3   Wilbur         b          56
4  Petunia         b          90
5  Gregory         b          50

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

Discussed on today's dev call, consensus was that if wide_to_long already handles this, we don't want another function. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

10 participants