add AsTable wrapper, disallow NamedTuple in ByRow #2183

bkamins · 2020-04-11T13:20:43Z

Fixes #2121

I still need to write tests and update manual and documentation

bkamins · 2020-04-11T19:02:45Z

@nalimilan - the PR should be good to review. Thank you!

bkamins · 2020-04-12T20:21:11Z

Traditionally only coveralls fail

docs/src/man/getting_started.md

nalimilan

Thanks. My main remark is about showing more useful examples. Otherwise people may not understand the point of this selector.

I'd also like to be sure we won't want to change the behavior when the function returns a named tuple.

docs/src/man/getting_started.md

docs/src/man/split_apply_combine.md

src/groupeddataframe/splitapplycombine.jl

src/other/utils.jl

bkamins · 2020-04-13T17:13:13Z

You mean that they are not destructured into columns? By the way, maybe we should reserve this in case we want to allow creating multiple columns by returning a named tuple in the future?

Let me start with the example on v0.20:

julia> using DataFrames

julia> df = DataFrame(g=1:2);

julia> by(df, :g, :g => x -> [(a=x,),])
2×2 DataFrame
│ Row │ g     │ g_function │
│     │ Int64 │ NamedTup…  │
├─────┼───────┼────────────┤
│ 1   │ 1     │ (a = [1],) │
│ 2   │ 2     │ (a = [2],) │

in other words now we are reproducing the behavior we already have.

Also on master select will do the same without ByRow:

julia> df = DataFrame(g=1:2)
2×1 DataFrame
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df = DataFrame(g=1:2);

julia> select(df, :g => x -> [(a=x,)])
1×1 DataFrame
│ Row │ g_function    │
│     │ NamedTuple…   │
├─────┼───────────────┤
│ 1   │ (a = [1, 2],) │

In summary ByRow returns a vector so it is not unwrapped as in other cases where we do not unwrap vectors.

If we disallowed this then there would be no possibility to return a produce a vector of NamedTuples from ByRow.

Note that we disallow returning NamedTuple (and other types but let us focus on NamedTuples) if it is not a ByRow operation. This is reserved for the case you describe in the future. However, note - that in this case this does not disallow producing a DataFrame holding NamedTuples as you can wrap them in a vector (as in the example above) or in Ref and you have no problem.

In summary: we currently treat a NamedTuple of objects as a table (so we disallow it) but treat a vector of NamedTuples just as a vector - so we allow it. Therefore my point is that we are consistent now with the design.
Also I do not see too much use of returning multiple columns from a single row of data (it is easy enough to do with multiple ByRow without performance loss as we are operating on single values anyway)

But if you feel that NamedTuple destruction from ByRow would be a potentially useful case then we can add this as an exception.

bkamins · 2020-04-13T17:15:42Z

My main remark is about showing more useful examples.

I will try to think of something, but I really believe that the default API is more useful most of the time (except for row aggregations - I will try to come up with some better cases).

@pdeffebach - can you please suggest something as I know that you had many good ideas here.

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2020-04-13T19:15:05Z

Regarding _filter_helper family:

I fixed the signatures (but with cols... it can be either multiple vectors or a single DataFrameRows object so I decided to leave it undefined)
I could not find a good place to move them and keep consistency in the code layout and readability, so I left them below the functions that use them

pdeffebach · 2020-04-13T21:09:14Z

@pdeffebach - can you please suggest something as I know that you had many good ideas here.

Sum and mean by row are the two use-cases, I think. A cool thing about AsTable would be working with the names super easily.

function findwinner(nt)
    m = maximum(nt)
    findall(x -> x == m, nt)
end

This returns a vector of the symbols equal to the max.

Also, I agree with Milan that unwrapping a named tuple would be convenient. A whole workflow around named-tuple to named-tuple functions seems very useful. I think I understand the logic of your objections though and they are valid points.

bkamins · 2020-04-13T21:49:45Z

Thank you for the suggestion. I will think of something along these lines and add it.

unwrapping a named tuple would be convenient

It will be allowed in the future. I just did not plan for it in ByRow (this can be changed as an exception if we want it). Do you have some example where AsTable(cols) => ByRow(fun) would usefully return a NamedTuple?

pdeffebach · 2020-04-13T22:09:11Z

Do you have some example where AsTable(cols) => ByRow(fun) would usefully return a NamedTuple?

With the kinds of surveys I worked with the past two years, maybe something like

function add_mean(nt)
    m = mean(skipmissing(nt))
    countmissing = sum(ismissing, nt)
    return (rowmean = m, countmissing = countmissing)
end

I can't say with complete confidence how much I would use this (I am not analyzing survey data on a day-to-day basis any more). But it seems like a natural workflow that is consistent with using AsTable without ByRow.

bkamins · 2020-04-14T06:42:48Z

Yeah - so it is essentially, when you want to conditionally on row filter out some columns, and based on these filtered columns produce multiple outputs. This is a valid use case. The question is how often it is needed (as a normal way to do such things would be to transform from wide to long format and then do aggregation).

As a side note a Vector of NamedTuples is not that bad data structure - it adds a layer of grouping variables in a DataFrame and I tend to like it (this is something e.g. BigQuery provides).

bkamins · 2020-04-14T08:17:09Z

This returns a vector of the symbols equal to the max.

Added such an example to getting started section.

nalimilan · 2020-04-14T09:59:55Z

Thanks for the explanation. I'm not sure the consistency is so visible for users, as they may also expect that returning a NamedTuple always creates multiple columns -- ignoring the fact that the result of ByRow is put into a vector.

I wonder what kind of solution we could use to allow both behaviors in the future if needed. Wrapping named tuples in Ref? That would mean that if you really want to return a Ref with ByRow, you would have to wrap it in another Ref. Maybe that's not too bad, and that would be consistent with the pseudo-broadcasting we apply when ByRow isn't used.

Anyway for now I'd err on the safe side and throw an error (as usual), returning named tuples from ByRow isn't the most urgent need.

docs/src/man/getting_started.md

docs/src/man/split_apply_combine.md

src/abstractdataframe/selection.jl

bkamins · 2020-04-14T10:37:23Z

Anyway for now I'd err on the safe side and throw an error (as usual

OK. Later you will be able to pass (a = (x=1,y=2),) to get your NamedTuple unwrapped into column :a.

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2020-04-14T15:39:28Z

CI passes here (nightly fails due to #2184).

docs/src/man/getting_started.md

src/abstractdataframe/selection.jl

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2020-04-14T20:11:15Z

Thank you!

stillyslalom · 2020-05-13T16:44:41Z

@bkamins
Yeah - so it is essentially, when you want to conditionally on row filter out some columns, and based on these filtered columns produce multiple outputs. This is a valid use case. The question is how often it is needed (as a normal way to do such things would be to transform from wide to long format and then do aggregation).

I have a use case where I'm doing something like a coordinate transformation (it can't be treated as an aggregation). In v0.20, I was able to do something like the following (boiled down considerably):

using Statistics, DataFrames, Polynomials

function forcecalc(df, weight_offsets)
    α = mean(df.α)  # angle of attack
    AF = mean(df.AF) - weight_offsets[1](α)    # axial force
    NF = mean(df.NF) - weight_offsets[2](α)    # normal force
    (L = -NF*cos(α) - AF*sin(α),    # lift
     D = -NF*sin(α) + AF*cos(α))    # drag
end

α = vcat([fill(v, 30) for v in -2:2:16]...)
AF =  10sind.(α) .+ rand(length(α))
NF = 100cosd.(α) .+ rand(length(α))
df = DataFrame(α = α, AF = AF, NF = NF)

wt = [Polynomial(rand(3)), Polynomial(rand(3))] # weight offsets
forces = by(df, :α, [:α, :AF, :NF] => df -> forcecalc(df, wt))

I'm subtracting a weight offset polynomial from raw force measurements (:AF and :NF), then rotating the force vector. I haven't found an elegant way to do this in v0.21, but my desired syntax is something like the following:

function forcecalc(nt::NamedTuple, weight_offsets)
    α = nt.α  # angle of attack
    AF = nt.AF - weight_offsets[1](α)    # axial force
    NF = nt.NF - weight_offsets[2](α)    # normal force
    (L = -NF*cos(α) - AF*sin(α),    # lift
     D = -NF*sin(α) + AF*cos(α))    # drag
end

varnames = [:AF, :NF]
gdf = groupby(df, :α)
measurements = combine(gdf, varnames .=> mean .=> varnames)
forces = transform(measurements, [:α, :AF, :NF] => ByRow(v -> forcecalc(v, wt)))

Note that in the former case, I'm both averaging measurements and applying the transformation in the 'apply' step of split-apply-combine, whereas in the latter case, I'm applying the transformation after s-a-c.

bkamins · 2020-05-13T17:23:50Z

I have not analyzed your code in detail, but given I know the design assumptions :) the code that should work without changing anything in your 0.20 code except the last line is:

forces = combine(AsTable([:α, :AF, :NF]) => nt -> forcecalc(nt, wt), groupby(nt, :α))

Can you please confirm that it works as expected?

(still maybe it is possible to write it in a cleaner way - I am just doing code transformation that should be equivalent between 0.20 and 0.21)

stillyslalom · 2020-05-13T17:47:00Z

Save for a typo (should be groupby(df, :α) instead of groupby(nt, :α)), it works! I think I was confused by the documentation for combine, which seems to imply that a GroupedDataFrame can be passed as either the first or second argument.

combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true)
combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true)

Arguments passed as args... can be:

Column transformation operations using the Pair notation that is described below and vectors of such pairs.

That said, I would prefer to perform reduction and transformation in two separate steps (as in my "v0.22" example), since the broader workflow involves applying the same reduction operation, followed by a variety of transformations.

bkamins · 2020-05-13T20:39:38Z

which seems to imply that a GroupedDataFrame can be passed as either the first or second argument.

This is indeed allowed and has a different behavior:

GroupedDataFrame passed as a first argument requires all following arguments to be Pairs and return only one columns
GroupedDataFrame passed as a second argument allows the first argument to be a Pair or a function (to allow do block syntax) and in this case you are allow to return one column or a table (i.e. multiple columns in one shot)

stillyslalom · 2020-05-13T22:16:27Z

Does the second behavior suffer the same performance impact for Pairs as it does for functions?

bkamins · 2020-05-13T22:32:38Z

No for Pair it is type-stable, and if you pass a pair like :a => sum that uses aggregation functions it will use the fastest code path available.

add AsTable wrapper

b1856e5

bkamins added non-breaking The proposed change is not breaking grouping labels Apr 11, 2020

bkamins added this to the 1.0 milestone Apr 11, 2020

bkamins added 2 commits April 11, 2020 18:09

add tests and documentation

4964f6a

add more tests

6dc91a3

bkamins added 2 commits April 12, 2020 19:22

add AsTable support to filter and filter!

2de5679

one more test

19a0899

pdeffebach reviewed Apr 12, 2020

View reviewed changes

docs/src/man/getting_started.md Outdated Show resolved Hide resolved

bkamins added 2 commits April 13, 2020 00:25

add more examples and make AsTable working with broadcasting

8507279

fix 1.0 tests

592fd0b

nalimilan reviewed Apr 13, 2020

View reviewed changes

bkamins and others added 4 commits April 13, 2020 20:46

changes after code review

0893e91

Apply suggestions from code review

c2c629d

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

Merge remote-tracking branch 'origin/add_astable' into add_astable

48d3515

additional fixes after code review

9897cbb

add argmax example

4027eff

nalimilan reviewed Apr 14, 2020

View reviewed changes

bkamins and others added 2 commits April 14, 2020 14:02

Merge branch 'master' into add_astable

1318121

Apply suggestions from code review

bf72b0c

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins changed the title ~~add AsTable wrapper~~ add AsTable wrapper, disallow NamedTuple in ByRow Apr 14, 2020

disallow NamedTuple in ByRow

462ed13

nalimilan reviewed Apr 14, 2020

View reviewed changes

docs/src/man/getting_started.md Outdated Show resolved Hide resolved

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

0ca1910

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

nalimilan approved these changes Apr 14, 2020

View reviewed changes

bkamins merged commit a931cb3 into JuliaData:master Apr 14, 2020

bkamins deleted the add_astable branch April 14, 2020 20:11

pdeffebach mentioned this pull request Apr 14, 2020

Emulating Stata's rowtotal #2161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add AsTable wrapper, disallow NamedTuple in ByRow #2183

add AsTable wrapper, disallow NamedTuple in ByRow #2183

bkamins commented Apr 11, 2020

bkamins commented Apr 11, 2020

bkamins commented Apr 12, 2020

nalimilan left a comment

bkamins commented Apr 13, 2020

bkamins commented Apr 13, 2020

bkamins commented Apr 13, 2020

pdeffebach commented Apr 13, 2020

bkamins commented Apr 13, 2020

pdeffebach commented Apr 13, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

nalimilan commented Apr 14, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

stillyslalom commented May 13, 2020

bkamins commented May 13, 2020

stillyslalom commented May 13, 2020 •

edited

Loading

bkamins commented May 13, 2020

stillyslalom commented May 13, 2020

bkamins commented May 13, 2020

add AsTable wrapper, disallow NamedTuple in ByRow #2183

add AsTable wrapper, disallow NamedTuple in ByRow #2183

Conversation

bkamins commented Apr 11, 2020

bkamins commented Apr 11, 2020

bkamins commented Apr 12, 2020

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Apr 13, 2020

bkamins commented Apr 13, 2020

bkamins commented Apr 13, 2020

pdeffebach commented Apr 13, 2020

bkamins commented Apr 13, 2020

pdeffebach commented Apr 13, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

nalimilan commented Apr 14, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

bkamins commented Apr 14, 2020

stillyslalom commented May 13, 2020

bkamins commented May 13, 2020

stillyslalom commented May 13, 2020 • edited Loading

bkamins commented May 13, 2020

stillyslalom commented May 13, 2020

bkamins commented May 13, 2020

stillyslalom commented May 13, 2020 •

edited

Loading