Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AsTable wrapper, disallow NamedTuple in ByRow #2183

Merged
merged 16 commits into from
Apr 14, 2020

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Apr 11, 2020

Fixes #2121

I still need to write tests and update manual and documentation

@bkamins bkamins added non-breaking The proposed change is not breaking grouping labels Apr 11, 2020
@bkamins bkamins added this to the 1.0 milestone Apr 11, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 11, 2020

@nalimilan - the PR should be good to review. Thank you!

@bkamins
Copy link
Member Author

bkamins commented Apr 12, 2020

Traditionally only coveralls fail

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. My main remark is about showing more useful examples. Otherwise people may not understand the point of this selector.

I'd also like to be sure we won't want to change the behavior when the function returns a named tuple.

docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Outdated Show resolved Hide resolved
src/groupeddataframe/splitapplycombine.jl Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
src/other/utils.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Apr 13, 2020

You mean that they are not destructured into columns? By the way, maybe we should reserve this in case we want to allow creating multiple columns by returning a named tuple in the future?

Let me start with the example on v0.20:

julia> using DataFrames

julia> df = DataFrame(g=1:2);

julia> by(df, :g, :g => x -> [(a=x,),])
2×2 DataFrame
│ Row │ g     │ g_function │
│     │ Int64 │ NamedTup…  │
├─────┼───────┼────────────┤
│ 1   │ 1     │ (a = [1],) │
│ 2   │ 2     │ (a = [2],) │

in other words now we are reproducing the behavior we already have.

Also on master select will do the same without ByRow:

julia> df = DataFrame(g=1:2)
2×1 DataFrame
│ Row │ g     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df = DataFrame(g=1:2);

julia> select(df, :g => x -> [(a=x,)])
1×1 DataFrame
│ Row │ g_function    │
│     │ NamedTuple…   │
├─────┼───────────────┤
│ 1   │ (a = [1, 2],) │

In summary ByRow returns a vector so it is not unwrapped as in other cases where we do not unwrap vectors.

If we disallowed this then there would be no possibility to return a produce a vector of NamedTuples from ByRow.

Note that we disallow returning NamedTuple (and other types but let us focus on NamedTuples) if it is not a ByRow operation. This is reserved for the case you describe in the future. However, note - that in this case this does not disallow producing a DataFrame holding NamedTuples as you can wrap them in a vector (as in the example above) or in Ref and you have no problem.

In summary: we currently treat a NamedTuple of objects as a table (so we disallow it) but treat a vector of NamedTuples just as a vector - so we allow it. Therefore my point is that we are consistent now with the design.
Also I do not see too much use of returning multiple columns from a single row of data (it is easy enough to do with multiple ByRow without performance loss as we are operating on single values anyway)

But if you feel that NamedTuple destruction from ByRow would be a potentially useful case then we can add this as an exception.

@bkamins
Copy link
Member Author

bkamins commented Apr 13, 2020

My main remark is about showing more useful examples.

I will try to think of something, but I really believe that the default API is more useful most of the time (except for row aggregations - I will try to come up with some better cases).

@pdeffebach - can you please suggest something as I know that you had many good ideas here.

@bkamins
Copy link
Member Author

bkamins commented Apr 13, 2020

Regarding _filter_helper family:

  1. I fixed the signatures (but with cols... it can be either multiple vectors or a single DataFrameRows object so I decided to leave it undefined)
  2. I could not find a good place to move them and keep consistency in the code layout and readability, so I left them below the functions that use them

@pdeffebach
Copy link
Contributor

@pdeffebach - can you please suggest something as I know that you had many good ideas here.

Sum and mean by row are the two use-cases, I think. A cool thing about AsTable would be working with the names super easily.

function findwinner(nt)
    m = maximum(nt)
    findall(x -> x == m, nt)
end

This returns a vector of the symbols equal to the max.

Also, I agree with Milan that unwrapping a named tuple would be convenient. A whole workflow around named-tuple to named-tuple functions seems very useful. I think I understand the logic of your objections though and they are valid points.

@bkamins
Copy link
Member Author

bkamins commented Apr 13, 2020

Thank you for the suggestion. I will think of something along these lines and add it.

unwrapping a named tuple would be convenient

It will be allowed in the future. I just did not plan for it in ByRow (this can be changed as an exception if we want it). Do you have some example where AsTable(cols) => ByRow(fun) would usefully return a NamedTuple?

@pdeffebach
Copy link
Contributor

Do you have some example where AsTable(cols) => ByRow(fun) would usefully return a NamedTuple?

With the kinds of surveys I worked with the past two years, maybe something like

function add_mean(nt)
    m = mean(skipmissing(nt))
    countmissing = sum(ismissing, nt)
    return (rowmean = m, countmissing = countmissing)
end

I can't say with complete confidence how much I would use this (I am not analyzing survey data on a day-to-day basis any more). But it seems like a natural workflow that is consistent with using AsTable without ByRow.

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

Yeah - so it is essentially, when you want to conditionally on row filter out some columns, and based on these filtered columns produce multiple outputs. This is a valid use case. The question is how often it is needed (as a normal way to do such things would be to transform from wide to long format and then do aggregation).

As a side note a Vector of NamedTuples is not that bad data structure - it adds a layer of grouping variables in a DataFrame and I tend to like it (this is something e.g. BigQuery provides).

@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

This returns a vector of the symbols equal to the max.

Added such an example to getting started section.

@nalimilan
Copy link
Member

Thanks for the explanation. I'm not sure the consistency is so visible for users, as they may also expect that returning a NamedTuple always creates multiple columns -- ignoring the fact that the result of ByRow is put into a vector.

I wonder what kind of solution we could use to allow both behaviors in the future if needed. Wrapping named tuples in Ref? That would mean that if you really want to return a Ref with ByRow, you would have to wrap it in another Ref. Maybe that's not too bad, and that would be consistent with the pseudo-broadcasting we apply when ByRow isn't used.

Anyway for now I'd err on the safe side and throw an error (as usual), returning named tuples from ByRow isn't the most urgent need.

docs/src/man/getting_started.md Show resolved Hide resolved
docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/getting_started.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Outdated Show resolved Hide resolved
docs/src/man/split_apply_combine.md Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

Anyway for now I'd err on the safe side and throw an error (as usual

OK. Later you will be able to pass (a = (x=1,y=2),) to get your NamedTuple unwrapped into column :a.

@bkamins bkamins changed the title add AsTable wrapper add AsTable wrapper, disallow NamedTuple in ByRow Apr 14, 2020
@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

CI passes here (nightly fails due to #2184).

docs/src/man/getting_started.md Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
src/abstractdataframe/selection.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
@bkamins bkamins merged commit a931cb3 into JuliaData:master Apr 14, 2020
@bkamins bkamins deleted the add_astable branch April 14, 2020 20:11
@bkamins
Copy link
Member Author

bkamins commented Apr 14, 2020

Thank you!

@stillyslalom
Copy link

@bkamins
Yeah - so it is essentially, when you want to conditionally on row filter out some columns, and based on these filtered columns produce multiple outputs. This is a valid use case. The question is how often it is needed (as a normal way to do such things would be to transform from wide to long format and then do aggregation).

I have a use case where I'm doing something like a coordinate transformation (it can't be treated as an aggregation). In v0.20, I was able to do something like the following (boiled down considerably):

using Statistics, DataFrames, Polynomials

function forcecalc(df, weight_offsets)
    α = mean(df.α)  # angle of attack
    AF = mean(df.AF) - weight_offsets[1](α)    # axial force
    NF = mean(df.NF) - weight_offsets[2](α)    # normal force
    (L = -NF*cos(α) - AF*sin(α),    # lift
     D = -NF*sin(α) + AF*cos(α))    # drag
end

α = vcat([fill(v, 30) for v in -2:2:16]...)
AF =  10sind.(α) .+ rand(length(α))
NF = 100cosd.(α) .+ rand(length(α))
df = DataFrame= α, AF = AF, NF = NF)

wt = [Polynomial(rand(3)), Polynomial(rand(3))] # weight offsets
forces = by(df, , [, :AF, :NF] => df -> forcecalc(df, wt))

I'm subtracting a weight offset polynomial from raw force measurements (:AF and :NF), then rotating the force vector. I haven't found an elegant way to do this in v0.21, but my desired syntax is something like the following:

function forcecalc(nt::NamedTuple, weight_offsets)
    α = nt.α  # angle of attack
    AF = nt.AF - weight_offsets[1](α)    # axial force
    NF = nt.NF - weight_offsets[2](α)    # normal force
    (L = -NF*cos(α) - AF*sin(α),    # lift
     D = -NF*sin(α) + AF*cos(α))    # drag
end

varnames = [:AF, :NF]
gdf = groupby(df, )
measurements = combine(gdf, varnames .=> mean .=> varnames)
forces = transform(measurements, [, :AF, :NF] => ByRow(v -> forcecalc(v, wt)))

Note that in the former case, I'm both averaging measurements and applying the transformation in the 'apply' step of split-apply-combine, whereas in the latter case, I'm applying the transformation after s-a-c.

@bkamins
Copy link
Member Author

bkamins commented May 13, 2020

I have not analyzed your code in detail, but given I know the design assumptions :) the code that should work without changing anything in your 0.20 code except the last line is:

forces = combine(AsTable([:α, :AF, :NF]) => nt -> forcecalc(nt, wt), groupby(nt, :α))

Can you please confirm that it works as expected?

(still maybe it is possible to write it in a cleaner way - I am just doing code transformation that should be equivalent between 0.20 and 0.21)

@stillyslalom
Copy link

stillyslalom commented May 13, 2020

Save for a typo (should be groupby(df, :α) instead of groupby(nt, :α)), it works! I think I was confused by the documentation for combine, which seems to imply that a GroupedDataFrame can be passed as either the first or second argument.

combine(gd::GroupedDataFrame, args...; keepkeys::Bool=true, ungroup::Bool=true)
combine(pair::Pair, gd::GroupedDataFrame; keepkeys::Bool=true, ungroup::Bool=true)

Arguments passed as args... can be:

  • Column transformation operations using the Pair notation that is described below and vectors of such pairs.

That said, I would prefer to perform reduction and transformation in two separate steps (as in my "v0.22" example), since the broader workflow involves applying the same reduction operation, followed by a variety of transformations.

@bkamins
Copy link
Member Author

bkamins commented May 13, 2020

which seems to imply that a GroupedDataFrame can be passed as either the first or second argument.

This is indeed allowed and has a different behavior:

  1. GroupedDataFrame passed as a first argument requires all following arguments to be Pairs and return only one columns
  2. GroupedDataFrame passed as a second argument allows the first argument to be a Pair or a function (to allow do block syntax) and in this case you are allow to return one column or a table (i.e. multiple columns in one shot)

@stillyslalom
Copy link

Does the second behavior suffer the same performance impact for Pairs as it does for functions?

@bkamins
Copy link
Member Author

bkamins commented May 13, 2020

No for Pair it is type-stable, and if you pass a pair like :a => sum that uses aggregation functions it will use the fastest code path available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
grouping non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a wrapper type for passing named tuples to functions when transforming
5 participants