Feature request: unstack multiple :values columns #2215

akdor1154 · 2020-04-28T08:51:30Z

in Pandas:

df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})

	paddockId	color	count	weight
0	0	red	3	0.2
1	0	blue	4	0.3
2	1	red	3	0.2
3	1	blue	4	0.3
4	2	red	3	0.2
5	2	blue	4	0.2

Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

	count		weight
paddockId	red	blue	red	blue
0	4	3	0.3	0.2
1	4	3	0.3	0.2
2	4	3	0.2	0.2

The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

The text was updated successfully, but these errors were encountered:

akdor1154 · 2020-04-28T09:06:30Z

naïve implementation:

function DataFrames.unstack(df::DataFrame, rowKeys::AbstractVector{Symbol}, colKey::Symbol, valueCols::AbstractVector{Symbol}, renameCols)
    unstacked = [
        unstack(df, rowKeys, colKey, valueCol, renamecols=(c) -> renameCols(c, valueCol))
        for valueCol in valueCols
    ]
    return join(unstacked..., on=rowKeys)
end

bkamins · 2020-04-28T09:54:54Z

This is a duplicate of #2148, so I am closing this issue. Please comment there if you find something more to add. Thank you for reporting this.

akdor1154 · 2020-04-28T10:17:56Z

I don't think it's a duplicate - I read that issue as wanting to pass multiple columns as colKey. Here I want to pass multiple columns to value, which is different (unless I am missing some deep symmetry to this operation).

bkamins · 2020-04-28T11:29:43Z

OK - you are right.

lkapelevich · 2020-10-20T17:58:19Z

+1, I've wanted to do this many times

sl-solution · 2021-05-01T23:41:03Z

in Pandas:
df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})
paddockId color count weight
0 0 red 3 0.2
1 0 blue 4 0.3
2 1 red 3 0.2
3 1 blue 4 0.3
4 2 red 3 0.2
5 2 blue 4 0.2
Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

count weight
paddockId red blue red blue
0 4 3 0.3 0.2
1 4 3 0.3 0.2
2 4 3 0.2 0.2
The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

#2743 is trying to give a solution for some of similar problems.

jonas-schulze · 2022-02-28T10:56:12Z

This would be very useful. Has there been progress on this in the meantime?

bkamins · 2022-02-28T11:14:15Z

No. But let us start with defining what we want exactly on a working example. Is this what you want:

julia> df = DataFrame(row=[1,1,2,2], col=[1,2,1,2], value1=1:4, value2=11:14)
4×4 DataFrame
 Row │ row    col    value1  value2
     │ Int64  Int64  Int64   Int64
─────┼──────────────────────────────
   1 │     1      1       1      11
   2 │     1      2       2      12
   3 │     2      1       3      13
   4 │     2      2       4      14

julia> unstack(select(df, :row, :col, AsTable(r"v") => Tables.rowtable => :value), :row, :col, :value)
2×3 DataFrame
 Row │ row    1                          2
     │ Int64  NamedTup…?                 NamedTup…?
─────┼─────────────────────────────────────────────────────────────
   1 │     1  (value1 = 1, value2 = 11)  (value1 = 2, value2 = 12)
   2 │     2  (value1 = 3, value2 = 13)  (value1 = 4, value2 = 14)

jonas-schulze · 2022-02-28T14:47:09Z

Almost. Instead of taking the data from a "stacked column" ((:col, :value) from your inner select), I would like to use several "unstacked columns" (:count and :weight). Using the opening example, I was hoping for unstack(df, :color, Cols(:count, :weight)) to yield df4 as in


julia> df = DataFrame(
           paddockId = [0, 0, 1, 1, 2, 2],
           color = repeat([:red, :blue], 3),
           count = repeat([3, 4], 3),
           weight = [0.2, 0.3, 0.2, 0.3, 0.2, 0.2],
       )
6×4 DataFrame
 Row │ paddockId  color   count  weight  
     │ Int64      Symbol  Int64  Float64 
─────┼───────────────────────────────────
   1 │         0  red         3      0.2
   2 │         0  blue        4      0.3
   3 │         1  red         3      0.2
   4 │         1  blue        4      0.3
   5 │         2  red         3      0.2
   6 │         2  blue        4      0.2

julia> df2 = select(
           df,
           :paddockId,
           :color,
           AsTable([:count, :weight]) => Tables.rowtable => :value,
       )
6×3 DataFrame
 Row │ paddockId  color   value                     
     │ Int64      Symbol  NamedTup…                 
─────┼──────────────────────────────────────────────
   1 │         0  red     (count = 3, weight = 0.2)
   2 │         0  blue    (count = 4, weight = 0.3)
   3 │         1  red     (count = 3, weight = 0.2)
   4 │         1  blue    (count = 4, weight = 0.3)
   5 │         2  red     (count = 3, weight = 0.2)
   6 │         2  blue    (count = 4, weight = 0.2)

julia> df3 = unstack(df2, :color, :value)
3×3 DataFrame
 Row │ paddockId  red                        blue                      
     │ Int64      NamedTup…?                 NamedTup…?                
─────┼─────────────────────────────────────────────────────────────────
   1 │         0  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   2 │         1  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   3 │         2  (count = 3, weight = 0.2)  (count = 4, weight = 0.2)

julia> df4 = select(df3,
           Not([:red, :blue]),
           :red => [:red_count, :red_weight],
           :blue => [:blue_count, :blue_weight],
       )
3×5 DataFrame
 Row │ paddockId  red_count  red_weight  blue_count  blue_weight 
     │ Int64      Int64      Float64     Int64       Float64     
─────┼───────────────────────────────────────────────────────────
   1 │         0          3         0.2           4          0.3
   2 │         1          3         0.2           4          0.3
   3 │         2          3         0.2           4          0.2

modulo the order of the columns, maybe. I'll describe that in the next comment to allow people to give separate feedback.

jonas-schulze · 2022-02-28T14:54:19Z

Ideally, I would say:

the new columns [red|blue]_* should take the position of the now removed color column.
*_[count|weight] should be ordered as requested (i.e. as in AsTable([:count, :weight])) or as they are ordered in df (if the selector does not impose an order).
It would probably be best to configure the order red_*, blue_* via a keyword argument. Sensible options include "order of first appearance" (default), lexicographically (for String, Symbol, Real, Complex, and Tuple) and as in a given Vector containing (a superset of) the unique values of color.

All but the default for (3.) could be implemented later, though.

jonas-schulze · 2022-02-28T14:57:48Z

Having some sort of "multi-level headers" oder indices would be nice, but I don't know what would be a nice user interface for that. Simply concatenating the column names would suffice for me (I would write the data to CSV and format the table header directly in LaTeX, I guess). Maybe this could be added on top later.

bkamins · 2022-02-28T15:05:14Z

"multi-level headers" are not possible to be supported in any near future (as opposed to pandas). We need to generate column names.

I was not rushing with the implementation of this request because we need to make the following design decisions. In general users want:

multiple value columns
multiple column key columns
multiple row key columns (we already have them and this is not problematic)

This requires us to decide on:

how do we name columns when unstack is on multiple columns and multiple values
handling of renamecols if multiple key columns are requested
handling of allowmissing if multiple key columns are requested
handling of allowduplicates if multiple key columns are requested
handling of fill kwarg if multiple value columns are requested
handling of valuestransform kwarg if multiple value columns are requested

and all these decisions need to be made before making any changes to make sure we will not have to make breaking changes later.

Lincoln-Hannah · 2022-04-14T01:46:10Z

Example - unstack multiple value columns (Please remove if not helpful)

DF = DataFrame( A=[1,2,2,3], B=[10,10,10,20], C=[missing,1,1,missing], D=['W','X','Y','Z'], E=['w','x','y','z'] )

unstack( DF, :A, :B, [:C,:D,:E] ,allowduplicates=true)

should work like

hcat( 
                unstack( DF, :A, :B, :C  ,allowduplicates=true),
    select!( unstack( DF, :A, :B, :D  ,allowduplicates=true), Not(:A) ),       
    select!( unstack( DF, :A, :B, :E  ,allowduplicates=true), Not(:A) ),
    makeunique=true
)

producing

Row │ A      10       20       10_1     20_1     10_2     20_2
─────┼──────────────────────────────
   1 │     1  missing  missing  W        missing  w        missing
   2 │     2        1  missing  Y        missing  y        missing
   3 │     3  missing  missing  missing  Z        missing  z

Though as mentioned by others the column suffix should be given by the value field giving column names

Row│ A 10_C 20_C 10_D 20_D 10_E 20_E

I think the duplicate handling for the single value field implementation should work equally well for multiple value fields

bkamins · 2022-12-05T11:52:18Z

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

bkamins closed this as completed Apr 28, 2020

bkamins reopened this Apr 28, 2020

bkamins added decision feature labels May 5, 2020

bkamins added this to the 1.x milestone May 5, 2020

bkamins mentioned this issue Sep 2, 2020

Transform long data set to wide with values in multiple columns #2405

Closed

bkamins added the reshaping label Apr 25, 2021

bkamins mentioned this issue Apr 25, 2021

requesting new feature which covers stack, unstack and permutedims in a simpler way (at least conceptually) #2732

Closed

sl-solution mentioned this issue May 1, 2021

Transposing DataFrame #2743

Closed

bkamins modified the milestones: 1.x, 1.4 Feb 28, 2022

bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022

bkamins mentioned this issue Dec 5, 2022

Redesign of stack and unstack #3237

Open

bkamins closed this as completed Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: unstack multiple :values columns #2215

Feature request: unstack multiple :values columns #2215

akdor1154 commented Apr 28, 2020

akdor1154 commented Apr 28, 2020

bkamins commented Apr 28, 2020

akdor1154 commented Apr 28, 2020

bkamins commented Apr 28, 2020

lkapelevich commented Oct 20, 2020

sl-solution commented May 1, 2021

jonas-schulze commented Feb 28, 2022

bkamins commented Feb 28, 2022

jonas-schulze commented Feb 28, 2022 •

edited

Loading

jonas-schulze commented Feb 28, 2022

jonas-schulze commented Feb 28, 2022

bkamins commented Feb 28, 2022 •

edited

Loading

Lincoln-Hannah commented Apr 14, 2022 •

edited

Loading

bkamins commented Dec 5, 2022

Feature request: unstack multiple :values columns #2215

Feature request: unstack multiple :values columns #2215

Comments

akdor1154 commented Apr 28, 2020

akdor1154 commented Apr 28, 2020

bkamins commented Apr 28, 2020

akdor1154 commented Apr 28, 2020

bkamins commented Apr 28, 2020

lkapelevich commented Oct 20, 2020

sl-solution commented May 1, 2021

jonas-schulze commented Feb 28, 2022

bkamins commented Feb 28, 2022

jonas-schulze commented Feb 28, 2022 • edited Loading

jonas-schulze commented Feb 28, 2022

jonas-schulze commented Feb 28, 2022

bkamins commented Feb 28, 2022 • edited Loading

Lincoln-Hannah commented Apr 14, 2022 • edited Loading

bkamins commented Dec 5, 2022

jonas-schulze commented Feb 28, 2022 •

edited

Loading

bkamins commented Feb 28, 2022 •

edited

Loading

Lincoln-Hannah commented Apr 14, 2022 •

edited

Loading