Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

Closed
sl-solution opened this issue Apr 23, 2021 · 5 comments

Comments

@sl-solution
Copy link

sl-solution commented Apr 23, 2021

I am not sure if this has been discussed or not, but I thought it may be an interesting feature to add.
The stack() and unstack() functions are very good tools for reshaping a DataFrame, however, whenever I am going to use these two functions I scratch my head about how to use them. The permutedims() is ok but not very flexible. In the following I am purposing a new feature for transposing a DataFrame, call it T function, that may simplify the reshaping of a DataFrame.

Let set the first argument of this function to be a DataFrame (df), and the second argument be a list of variables ([:v1,...,:vp]) that I am interested to transpose. In practice these variables should be somehow homogenous (i.e. all be numeric or all be strings). It means that df[!, [:v1,...,:vp]] is a rectangle array of homogenous data (in some sense).

In the simplest form T(df, [:v1,...,:vp]) return a new DataFrame which the rows of df[!, [:v1,...,:vp]] become its columns and the columns of df[!, [:v1,...,:vp]] become its rows. Since df doesn't have row names the column names for the new DataFrame can be generated, e.g. [:c1,..., :cq] (or whatever). However, the new DataFrame has an extra column which includes the name of the transposed variables. E.g. if df is

v1 v2 v3
12 1 14
20 3 6

T(df,[:v1,:v2,:v3]) is

variables c1 c2
v1 12 20
v2 1 3
v3 14 6

Now suppose that my data are grouped by some variables, e.g. in the following data my data are grouped by variable :Country.

Country Sex Pop2000 Pop2010 Pop2020
c1 male 100 120 140
c1 female 110 125 130
c2 male 200 210 215
c2 female 200 220 230

For this example T accept a third argument which is the grouping variable(s), and it does the same thing as the simplest case but within each group and has an extra column which keeps track of the group information, i.e. the output of T(df,[:Pop2000,:Pop2010,:Pop2020], [:Country]) is

Group variables c1 c2
c1 Pop2000 100 110
c1 Pop2010 120 125
c1 Pop2020 140 130
c2 Pop2000 200 200
c2 Pop2010 210 220
c2 Pop2020 215 230

However, for this example :c1 and :c2 are actually meaningful, they are the population for male and female. So T can accept an optional argument which is the name to use instead of c1, ..... I.e. the output of T(df,[:Pop2000,:Pop2010,:Pop2020], [:Country],:name) is

Group variables male female
c1 Pop2000 100 110
c1 Pop2010 120 125
c1 Pop2020 140 130
c2 Pop2000 200 200
c2 Pop2010 210 220
c2 Pop2020 215 230

When a group has less rows just the output table fills it with missings.

Actually, T can handle most of what stack, unstack and permutedims do.

Stack

Let see how it works when we want to stack. I borrow the following example from help

a b c d e
1 1 1 1 a
1 1 1 2 b
2 1 1 3 c
2 2 1 4 d
3 2 1 5 e
3 2 1 6 f

stack(df,[:c,:d]) is similar to T(df, [:c,:d],[:a,:b]), i.e. I want to transpose :c and :d in each group constructed by [:a, :b].

unstack

Another example from help, let df (I reorder :id an :a for the sake of demonstration)

id a variable value
1 1 b 1.0
1 1 c 1.0
1 1 d 1.0
2 1 b 1.0
2 1 c 1.0
2 1 d 1.0
... ... ... ...

unstack(long, :id, :variable, :value) is similar to T(df,[:value],[:id,:a], :variable) i.e. transpose :value within each group constructed by [:id,:a].

Another example

Hospital Patient visits measure
H1 id1 1 12
H1 id1 2 13
H1 id1 3 14
H1 id2 1 13
H1 id2 2 11
H2 id3 1 14
H2 id3 2 15

If I want to have information of each patient in one row I can use T(df, [:measure],[:Hospital,:Patient], :visits), i.e. transpose every measures of a given patient and put missing when a group has less rows than the others.

@bkamins bkamins added this to the 1.x milestone Apr 23, 2021
@bkamins
Copy link
Member

bkamins commented Apr 24, 2021

If I understand your example correctly you want this (I am using your data):

combine(groupby(df, :Country), sdf -> permutedims(select(sdf, Not(:Country)), :Sex, :Name))

Is this correct? (except that this pattern will not automatically fill missing levels with missing)

@sl-solution
Copy link
Author

If I understand your example correctly you want this (I am using your data):

combine(groupby(df, :Country), sdf -> permutedims(select(sdf, Not(:Country)), :Sex, :Name))

Is this correct? (except that this pattern will not automatically fill missing levels with missing)

That is correct, simple transposing within each group. Just as you mentioned missing should be taking care of and also the permutedims() function needs its second argument to beSymbol/String which is restrictive. The good thing is that, the new functionality can replace stack, unstack and permutedims.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2021

permutedims function needs its second argument to beSymbol/String which is restrictive.

This is intentional. In particular in your examples above I was not clear about the rules how column names should be generated in case an explicit information about column name is missing.

@sl-solution
Copy link
Author

... In particular in your examples above I was not clear about the rules how column names should be generated in case an explicit information about column name is missing.

I think some generic names can be used, for example c1, c2, ....

@bkamins
Copy link
Member

bkamins commented Apr 25, 2021

Anyway - I think the best way to move forward in this case is:

  1. if you would be willing you can implement such a method in a separate package (there are several such packages already in the ecosystem) or just in a PR
  2. then we can get exposed to the community (therefore a separate package is preferred as it would be easier to use) - and wait for the feedback
  3. if it gets wider acceptance then we can discuss including it in the main DataFrames.jl package

When developing the functionality ideally you can think how your proposal plays with the following earlier related requests: #1181, #2698, #2422, #2215, #2205, #2148, #1839

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants