Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: unstack multiple :values columns #2215

Closed
akdor1154 opened this issue Apr 28, 2020 · 14 comments
Closed

Feature request: unstack multiple :values columns #2215

akdor1154 opened this issue Apr 28, 2020 · 14 comments

Comments

@akdor1154
Copy link

in Pandas:

df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})
  paddockId color count weight
0 0 red 3 0.2
1 0 blue 4 0.3
2 1 red 3 0.2
3 1 blue 4 0.3
4 2 red 3 0.2
5 2 blue 4 0.2

Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

count weight
paddockId red blue red blue
0 4 3 0.3 0.2
1 4 3 0.3 0.2
2 4 3 0.2 0.2

The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

@akdor1154
Copy link
Author

naïve implementation:

function DataFrames.unstack(df::DataFrame, rowKeys::AbstractVector{Symbol}, colKey::Symbol, valueCols::AbstractVector{Symbol}, renameCols)
    unstacked = [
        unstack(df, rowKeys, colKey, valueCol, renamecols=(c) -> renameCols(c, valueCol))
        for valueCol in valueCols
    ]
    return join(unstacked..., on=rowKeys)
end

@bkamins
Copy link
Member

bkamins commented Apr 28, 2020

This is a duplicate of #2148, so I am closing this issue. Please comment there if you find something more to add. Thank you for reporting this.

@bkamins bkamins closed this as completed Apr 28, 2020
@akdor1154
Copy link
Author

I don't think it's a duplicate - I read that issue as wanting to pass multiple columns as colKey. Here I want to pass multiple columns to value, which is different (unless I am missing some deep symmetry to this operation).

@bkamins bkamins reopened this Apr 28, 2020
@bkamins
Copy link
Member

bkamins commented Apr 28, 2020

OK - you are right.

@lkapelevich
Copy link

+1, I've wanted to do this many times

@sl-solution
Copy link

in Pandas:

df = pd.DataFrame({
    'paddockId': [0, 0, 1, 1, 2, 2],
    'color': ['red', 'blue', 'red', 'blue', 'red', 'blue'],
    'count': [3, 4, 3, 4, 3, 4],
    'weight': [0.2, 0.3, 0.2, 0.3, 0.2, 0.2]
})

  paddockId color count weight
0 0 red 3 0.2
1 0 blue 4 0.3
2 1 red 3 0.2
3 1 blue 4 0.3
4 2 red 3 0.2
5 2 blue 4 0.2
Given this, you can unstack on multiple value columns by just passing e.g. df.set_index(['paddockId', 'color']).unstack('color')

count weight
paddockId red blue red blue
0 4 3 0.3 0.2
1 4 3 0.3 0.2
2 4 3 0.2 0.2
The equivalent operation seems pretty hard in DataFrames.jl at the moment (unstack n times then join the results?). API-wise, it would be good if the values argument could take a vector of values. Result formatting could be difficult, this particular operation is one that makes MultiIndexes shine... I guess renameCols could become a function of (colKey value, valueColumnName) (so in this instance, (color, valueCol) -> ... e.g. ('red', :count') -> ....?

#2743 is trying to give a solution for some of similar problems.

@jonas-schulze
Copy link
Contributor

This would be very useful. Has there been progress on this in the meantime?

@bkamins
Copy link
Member

bkamins commented Feb 28, 2022

No. But let us start with defining what we want exactly on a working example. Is this what you want:

julia> df = DataFrame(row=[1,1,2,2], col=[1,2,1,2], value1=1:4, value2=11:14)
4×4 DataFrame
 Row │ row    col    value1  value2
     │ Int64  Int64  Int64   Int64
─────┼──────────────────────────────
   1 │     1      1       1      11
   2 │     1      2       2      12
   3 │     2      1       3      13
   4 │     2      2       4      14

julia> unstack(select(df, :row, :col, AsTable(r"v") => Tables.rowtable => :value), :row, :col, :value)
2×3 DataFrame
 Row │ row    1                          2
     │ Int64  NamedTup…?                 NamedTup…?
─────┼─────────────────────────────────────────────────────────────
   1 │     1  (value1 = 1, value2 = 11)  (value1 = 2, value2 = 12)
   2 │     2  (value1 = 3, value2 = 13)  (value1 = 4, value2 = 14)

@bkamins bkamins modified the milestones: 1.x, 1.4 Feb 28, 2022
@jonas-schulze
Copy link
Contributor

jonas-schulze commented Feb 28, 2022

Almost. Instead of taking the data from a "stacked column" ((:col, :value) from your inner select), I would like to use several "unstacked columns" (:count and :weight). Using the opening example, I was hoping for unstack(df, :color, Cols(:count, :weight)) to yield df4 as in


julia> df = DataFrame(
           paddockId = [0, 0, 1, 1, 2, 2],
           color = repeat([:red, :blue], 3),
           count = repeat([3, 4], 3),
           weight = [0.2, 0.3, 0.2, 0.3, 0.2, 0.2],
       )
6×4 DataFrame
 Row │ paddockId  color   count  weight  
     │ Int64      Symbol  Int64  Float64 
─────┼───────────────────────────────────
   1 │         0  red         3      0.2
   2 │         0  blue        4      0.3
   3 │         1  red         3      0.2
   4 │         1  blue        4      0.3
   5 │         2  red         3      0.2
   6 │         2  blue        4      0.2

julia> df2 = select(
           df,
           :paddockId,
           :color,
           AsTable([:count, :weight]) => Tables.rowtable => :value,
       )
6×3 DataFrame
 Row │ paddockId  color   value                     
     │ Int64      Symbol  NamedTup…                 
─────┼──────────────────────────────────────────────
   1 │         0  red     (count = 3, weight = 0.2)
   2 │         0  blue    (count = 4, weight = 0.3)
   3 │         1  red     (count = 3, weight = 0.2)
   4 │         1  blue    (count = 4, weight = 0.3)
   5 │         2  red     (count = 3, weight = 0.2)
   6 │         2  blue    (count = 4, weight = 0.2)

julia> df3 = unstack(df2, :color, :value)
3×3 DataFrame
 Row │ paddockId  red                        blue                      
     │ Int64      NamedTup…?                 NamedTup…?                
─────┼─────────────────────────────────────────────────────────────────
   1 │         0  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   2 │         1  (count = 3, weight = 0.2)  (count = 4, weight = 0.3)
   3 │         2  (count = 3, weight = 0.2)  (count = 4, weight = 0.2)

julia> df4 = select(df3,
           Not([:red, :blue]),
           :red => [:red_count, :red_weight],
           :blue => [:blue_count, :blue_weight],
       )
3×5 DataFrame
 Row │ paddockId  red_count  red_weight  blue_count  blue_weight 
     │ Int64      Int64      Float64     Int64       Float64     
─────┼───────────────────────────────────────────────────────────
   1 │         0          3         0.2           4          0.3
   2 │         1          3         0.2           4          0.3
   3 │         2          3         0.2           4          0.2

modulo the order of the columns, maybe. I'll describe that in the next comment to allow people to give separate feedback.

@jonas-schulze
Copy link
Contributor

Ideally, I would say:

  1. the new columns [red|blue]_* should take the position of the now removed color column.
  2. *_[count|weight] should be ordered as requested (i.e. as in AsTable([:count, :weight])) or as they are ordered in df (if the selector does not impose an order).
  3. It would probably be best to configure the order red_*, blue_* via a keyword argument. Sensible options include "order of first appearance" (default), lexicographically (for String, Symbol, Real, Complex, and Tuple) and as in a given Vector containing (a superset of) the unique values of color.

All but the default for (3.) could be implemented later, though.

@jonas-schulze
Copy link
Contributor

Having some sort of "multi-level headers" oder indices would be nice, but I don't know what would be a nice user interface for that. Simply concatenating the column names would suffice for me (I would write the data to CSV and format the table header directly in LaTeX, I guess). Maybe this could be added on top later.

@bkamins
Copy link
Member

bkamins commented Feb 28, 2022

"multi-level headers" are not possible to be supported in any near future (as opposed to pandas). We need to generate column names.

I was not rushing with the implementation of this request because we need to make the following design decisions. In general users want:

  • multiple value columns
  • multiple column key columns
  • multiple row key columns (we already have them and this is not problematic)

This requires us to decide on:

  • how do we name columns when unstack is on multiple columns and multiple values
  • handling of renamecols if multiple key columns are requested
  • handling of allowmissing if multiple key columns are requested
  • handling of allowduplicates if multiple key columns are requested
  • handling of fill kwarg if multiple value columns are requested
  • handling of valuestransform kwarg if multiple value columns are requested

and all these decisions need to be made before making any changes to make sure we will not have to make breaking changes later.

@Lincoln-Hannah
Copy link

Lincoln-Hannah commented Apr 14, 2022

Example - unstack multiple value columns (Please remove if not helpful)

DF = DataFrame( A=[1,2,2,3], B=[10,10,10,20], C=[missing,1,1,missing], D=['W','X','Y','Z'], E=['w','x','y','z'] )

unstack( DF, :A, :B, [:C,:D,:E] ,allowduplicates=true)

should work like

hcat( 
                unstack( DF, :A, :B, :C  ,allowduplicates=true),
    select!( unstack( DF, :A, :B, :D  ,allowduplicates=true), Not(:A) ),       
    select!( unstack( DF, :A, :B, :E  ,allowduplicates=true), Not(:A) ),
    makeunique=true
)

producing

Row │ A      10       20       10_1     20_1     10_2     20_2
─────┼──────────────────────────────
   1 │     1  missing  missing  W        missing  w        missing
   2 │     2        1  missing  Y        missing  y        missing
   3 │     3  missing  missing  missing  Z        missing  z

Though as mentioned by others the column suffix should be given by the value field giving column names

Row│ A 10_C 20_C 10_D 20_D 10_E 20_E

I think the duplicate handling for the single value field implementation should work equally well for multiple value fields

@bkamins bkamins modified the milestones: 1.4, 1.5 Jun 7, 2022
@bkamins
Copy link
Member

bkamins commented Dec 5, 2022

I am closing this in favor of #3237 (to have a single place to discuss all related issues)

@bkamins bkamins closed this as completed Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants