requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

sl-solution · 2021-04-23T00:14:50Z

I am not sure if this has been discussed or not, but I thought it may be an interesting feature to add.
The stack() and unstack() functions are very good tools for reshaping a DataFrame, however, whenever I am going to use these two functions I scratch my head about how to use them. The permutedims() is ok but not very flexible. In the following I am purposing a new feature for transposing a DataFrame, call it T function, that may simplify the reshaping of a DataFrame.

Let set the first argument of this function to be a DataFrame (df), and the second argument be a list of variables ([:v1,...,:vp]) that I am interested to transpose. In practice these variables should be somehow homogenous (i.e. all be numeric or all be strings). It means that df[!, [:v1,...,:vp]] is a rectangle array of homogenous data (in some sense).

In the simplest form T(df, [:v1,...,:vp]) return a new DataFrame which the rows of df[!, [:v1,...,:vp]] become its columns and the columns of df[!, [:v1,...,:vp]] become its rows. Since df doesn't have row names the column names for the new DataFrame can be generated, e.g. [:c1,..., :cq] (or whatever). However, the new DataFrame has an extra column which includes the name of the transposed variables. E.g. if df is

v1	v2	v3
12	1	14
20	3	6

T(df,[:v1,:v2,:v3]) is

variables	c1	c2
v1	12	20
v2	1	3
v3	14	6

Now suppose that my data are grouped by some variables, e.g. in the following data my data are grouped by variable :Country.

Country	Sex	Pop2000	Pop2010	Pop2020
c1	male	100	120	140
c1	female	110	125	130
c2	male	200	210	215
c2	female	200	220	230

For this example T accept a third argument which is the grouping variable(s), and it does the same thing as the simplest case but within each group and has an extra column which keeps track of the group information, i.e. the output of T(df,[:Pop2000,:Pop2010,:Pop2020], [:Country]) is

Group	variables	c1	c2
c1	Pop2000	100	110
c1	Pop2010	120	125
c1	Pop2020	140	130
c2	Pop2000	200	200
c2	Pop2010	210	220
c2	Pop2020	215	230

However, for this example :c1 and :c2 are actually meaningful, they are the population for male and female. So T can accept an optional argument which is the name to use instead of c1, ..... I.e. the output of T(df,[:Pop2000,:Pop2010,:Pop2020], [:Country],:name) is

Group	variables	male	female
c1	Pop2000	100	110
c1	Pop2010	120	125
c1	Pop2020	140	130
c2	Pop2000	200	200
c2	Pop2010	210	220
c2	Pop2020	215	230

When a group has less rows just the output table fills it with missings.

Actually, T can handle most of what stack, unstack and permutedims do.

Stack

Let see how it works when we want to stack. I borrow the following example from help

a	b	c	d	e
1	1	1	1	a
1	1	1	2	b
2	1	1	3	c
2	2	1	4	d
3	2	1	5	e
3	2	1	6	f

stack(df,[:c,:d]) is similar to T(df, [:c,:d],[:a,:b]), i.e. I want to transpose :c and :d in each group constructed by [:a, :b].

unstack

Another example from help, let df (I reorder :id an :a for the sake of demonstration)

id	a	variable	value
1	1	b	1.0
1	1	c	1.0
1	1	d	1.0
2	1	b	1.0
2	1	c	1.0
2	1	d	1.0
...	...	...	...

unstack(long, :id, :variable, :value) is similar to T(df,[:value],[:id,:a], :variable) i.e. transpose :value within each group constructed by [:id,:a].

Another example

Hospital	Patient	visits	measure
H1	id1	1	12
H1	id1	2	13
H1	id1	3	14
H1	id2	1	13
H1	id2	2	11
H2	id3	1	14
H2	id3	2	15

If I want to have information of each patient in one row I can use T(df, [:measure],[:Hospital,:Patient], :visits), i.e. transpose every measures of a given patient and put missing when a group has less rows than the others.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-04-24T23:26:05Z

If I understand your example correctly you want this (I am using your data):

combine(groupby(df, :Country), sdf -> permutedims(select(sdf, Not(:Country)), :Sex, :Name))

Is this correct? (except that this pattern will not automatically fill missing levels with missing)

sl-solution · 2021-04-25T00:35:10Z

If I understand your example correctly you want this (I am using your data):
combine(groupby(df, :Country), sdf -> permutedims(select(sdf, Not(:Country)), :Sex, :Name))
Is this correct? (except that this pattern will not automatically fill missing levels with missing)

That is correct, simple transposing within each group. Just as you mentioned missing should be taking care of and also the permutedims() function needs its second argument to beSymbol/String which is restrictive. The good thing is that, the new functionality can replace stack, unstack and permutedims.

bkamins · 2021-04-25T07:14:01Z

permutedims function needs its second argument to beSymbol/String which is restrictive.

This is intentional. In particular in your examples above I was not clear about the rules how column names should be generated in case an explicit information about column name is missing.

sl-solution · 2021-04-25T07:29:29Z

... In particular in your examples above I was not clear about the rules how column names should be generated in case an explicit information about column name is missing.

I think some generic names can be used, for example c1, c2, ....

bkamins · 2021-04-25T08:25:58Z

Anyway - I think the best way to move forward in this case is:

if you would be willing you can implement such a method in a separate package (there are several such packages already in the ecosystem) or just in a PR
then we can get exposed to the community (therefore a separate package is preferred as it would be easier to use) - and wait for the feedback
if it gets wider acceptance then we can discuss including it in the main DataFrames.jl package

When developing the functionality ideally you can think how your proposal plays with the following earlier related requests: #1181, #2698, #2422, #2215, #2205, #2148, #1839

bkamins added the feature label Apr 23, 2021

bkamins added this to the 1.x milestone Apr 23, 2021

bkamins added the reshaping label Apr 25, 2021

sl-solution closed this as completed May 1, 2021

sl-solution mentioned this issue May 1, 2021

Transposing DataFrame #2743

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

sl-solution commented Apr 23, 2021 •

edited

Loading

bkamins commented Apr 24, 2021

sl-solution commented Apr 25, 2021

bkamins commented Apr 25, 2021

sl-solution commented Apr 25, 2021

bkamins commented Apr 25, 2021

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

Comments

sl-solution commented Apr 23, 2021 • edited Loading

bkamins commented Apr 24, 2021

sl-solution commented Apr 25, 2021

bkamins commented Apr 25, 2021

sl-solution commented Apr 25, 2021

bkamins commented Apr 25, 2021

sl-solution commented Apr 23, 2021 •

edited

Loading