Accepting array element in rows specificed by named tuples, in `combine` #3335

jonalm · 2023-05-22T18:51:51Z

Creating a DataFrame from a list of named tuple works fine, even when the nt elements contain arrays

DataFrame([
    (; foo="bar", boo=[1,2]),
    (; foo="baz", boo=[1,2])
]
)

But that is not the case if the DataFrame is created by a combine operations

combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
    return (;
        foo="foo",
        boo=[1,2]
    )
end

gives

    nested task error: ArgumentError: mixing single values and vectors in a named tuple is not allowed

The above example would work as expected if one removes these two lines:

DataFrames.jl/src/groupeddataframe/complextransforms.jl

Line 52 in fcd98c6

    
           throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))

and

DataFrames.jl/src/groupeddataframe/callprocessing.jl

Line 69 in fcd98c6

    
           throw(ArgumentError("mixing single values and vectors in a named tuple is not allowed"))

An obvious use-case for needing this is when you want some ordered indices or values from the group, in a single row for each group.

Would you be willing to reconsider these exceptions, and allow arrays as named tuple elements when using combine?

The text was updated successfully, but these errors were encountered:

bkamins · 2023-05-22T22:53:55Z

Would you be willing to reconsider these exceptions, and allow arrays as named tuple elements when using combine?

Please correct me if I am wrong, but this would imply that (; x=[1, 2, 3], y=[2, 3, 4]) would be interpreted as a named tuple defining a single row. Whereas currently, it defines three rows.

jonalm · 2023-05-23T07:27:47Z

@bkamins I guess there are implicitly two suggestions here for the output of the function argument to combine:

(Non breaking) remove the above mentioned error, and allow named tuple outputs with a combinations of scalar and array values, and interpret those as rows in the resulting DataFrame.
(Breaking) change the behaviour such that named tuple output always produce a row in the resulting DataFrame.

The rational for 2 is that it is easy to explicitly wrap the output (of the function argument to combine) into a DataFrame, if you want the combined DataFrame to be a stack of DataFrames. But it is not clear (at least not to me) how to return a row with both scalar and array elements. It would be natural to achieve that with a named tuple output (always interpreted as a row). 1 would be a non-breaking quick fix to have this functionality for named tuples with mixed (scalar/array) elements, but may render the behaviour unexpected to some users.

My preference is shaped by my usage of combine: in the majority of cases I want to stack row outputs for each group.

bkamins · 2023-05-23T10:36:12Z

But it is not clear (at least not to me) how to return a row with both scalar and array elements.

It was discussed on Slack.
If you have f() = (; x=1, y=[1,2])
You should return instead of:

julia> f()
(x = 1, y = [1, 2])

rather wrap all in additional vectors

julia> map(x -> [x], f())
(x = [1], y = [[1, 2]])

or equivalently:

julia> map(Base.vect, f())
(x = [1], y = [[1, 2]])

(or directly in definition of f wrap vectors in vectors)

I am aware that it is a bit inconvenient.

We will not make the option 2. (breaking change) because it is breaking.

Option 1. could be acceptable, but it is problematic as the rule determining how named tuple is handled would become even more complex than it is now (in general: users already find combine processing rule to have many cases)

Indeed in the code when have these two errors thrown because we wanted to leave the freedom in the future to decide what to do in that case (@nalimilan style 😄). So the conclusion is that we can discuss it, but we want to reach a consensus before making such a fundamental change. One of the reasons for this is that typically users could expect that (; x=1, y=[1, 2]) should be expanded to DataFrame(x=[1,1], y=[1,2]) instead (as discussed - this is what normally would DataFrame constructor do).

jonalm · 2023-05-23T12:06:16Z

So if I understand you correctly @bkamins: the best way of having a single row output from the combining function, which is robust against the potential content of that row (in the above mentioned case: arrays), is to wrap that row into a DataFrame with one single row, or to wrap it into a named tuple which would be expanded into a single rowed DataFrame in the DataFrame constructor?

If that is the case: wouldn't it be more natural to have a convention for explicitly returning a row? If named tuples are out of the question, could we have a Row-like type (DataFrameRow?) which acts like a named tuple?

bkamins · 2023-05-23T12:25:31Z

DataFrameRow - you can use it already. The only limitation is that it is not convenient to create it currently.

Indeed, though, it would be useful to add support for:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

where your fun is returning a named tuple.

Do you think it would be useful to add something like this?

CC @nalimilan, @pdeffebach, @ararslan

The change it would require in the minilanguage is to add support for Tables.AbstractRow, just like we did it for pushing in #3245.

I think it would be OK to add it.

pdeffebach · 2023-05-23T13:36:03Z

I wrote a package for this purpose at some point. I didn't make it very far clearly, but the bones are here in DataRows.jl.

But party of the reason it was abandoned was precisely because combine(f, gd) doesn't play well with Tables.jl. combine(f, gd) doesn't check if something is Tables.istable and behave accordingly. We are stuck with NamedTuples, which as you mention have an ambiguity in the case of mixed types.

I mentioned this on Slack, but I do think the long term solution is 2.0, where combine(f, df) is re-written to match combine(gd, src => fun => dest).

As for the current behavior, I think it's not a big deal to require the use of arrays. If performance is important, Refs are always available.

bkamins · 2023-05-23T15:29:12Z

doesn't check if something is Tables.istable and behave accordingly.

The reason is that Tables.istable is not a very reliable function. I prefer not to use it. Having said that, here we are talking about Tables.AbstractRow subtypes, which is a well defined interface.

As for the current behavior, I think it's not a big deal to require the use of arrays

Still, I think recognizing Tables.AbstractRow is a simple fix that should not be problematic.

My question is, for 1.x release branch of DataFrames.jl if we feel that adding Tables.AbstractRow support is OK (and robust for 2.x release)?

bkamins · 2023-05-23T15:30:38Z

@pdeffebach - to expand. Tables.istable is trait based, which is fragile. Tables.AbstractRow rule is dispatch based, which makes it more robust.

jonalm · 2023-05-24T11:46:01Z

Indeed, though, it would be useful to add support for:
combine(gdf, src => Tables.Row ∘ fun => AsTable)
where your fun is returning a named tuple.

Do you think it would be useful to add something like this?

@bkamins I'm not sure I understand the implications of your suggestion. How I would use combine with the proposed solution?

Also, perhaps a bit tangential, why do you consider a trait based approach to be fragile?

bkamins · 2023-05-24T11:56:06Z

How I would use combine with the proposed solution?

Just as I have written above:

combine(gdf, src => Tables.Row ∘ fun => AsTable)

passing return value of fun to Tables.Row forces the value returned by fun to be treated as one row.

why do you consider a trait based approach to be fragile?

For two reasons:

major: if Tables.istable returns true you are sure that you are working with a table; but if it returns false it does not mean that it is not a table (it only means that it might not be a table)
minor: with type-based dispatch we are sure that the behavior cannot change dynamically; with trait based dispatch the behavior can be changed by the user (i.e. the user could define a method for Tables.istable and change the behavior of DataFrames.jl - indeed sometimes it could be useful, but in general I prefer to be able to determine statically how DataFrames.jl will evaluate some expression in cases it is possible)

(but this is unrelated to your issue, as table-row is not a table, so Tables.istable will not help here)

jonalm · 2023-05-24T13:09:49Z

How I would use combine with the proposed solution?

Just as I have written above:
combine(gdf, src => Tables.Row ∘ fun => AsTable)

sorry for being an idiot, but I don't get what src would be in your example?

bkamins · 2023-05-24T13:20:25Z

src is a specification of source columns that should be passed to fun. See "Basic transformations" section (and sections that follow it) in https://bkamins.github.io/julialang/2020/12/24/minilanguage.html.

jonalm · 2023-05-24T13:39:03Z

@bkamins Thanks, and thanks for elaborating on the trait issue.

I think your proposal would be very useful.

pdeffebach · 2023-05-24T13:39:55Z

My only comment is that a DataFrames-specific type, like AsRow might be more useful.

bkamins · 2023-05-24T19:08:45Z

What would be the benefit of AsRow over Tables.Row that is already defined in the ecosystem?

The definition would be AsRow(fun) = Tables.Row ∘ fun. (so it would be a function not a type)

sprmnt21 · 2023-05-25T13:18:38Z

I did not understand if such a workaround was considered and deemed unsatisfactory for some reason.
When in doubt, I propose it, so I better understand why it's not good for the case.

combine(groupby(DataFrame(:a=>[1,2,3,3]),:a)) do g
    DataFrame([(;
        b="foo",
        c=[1,2]
    )])
end

bkamins · 2023-05-26T01:39:28Z

I better understand why it's not good for the case.

OP wants (; b="foo", c=[1,2]) to be interpreted as a single row not as a table that would expand :b value into two rows.

sprmnt21 · 2023-05-26T03:57:51Z

I get that you're looking for this, isn't it?

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           DataFrame([(;
               foo="foo",
               boo=[1,2]
           )])
       end
2×3 DataFrame
 Row │ group  foo     boo    
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     2  foo     [1, 2]`

jonalm · 2023-05-26T06:53:23Z

@sprmnt21 : Yes. This issue is a request to avoid having to construct a full DataFrame to get a row.

bkamins · 2023-05-26T10:06:28Z

Ah - I missed [ in @sprmnt21 proposal. This makes sense, so what currently works is:

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group), :group => (g ->
                  [(;
                      foo="foo",
                      boo=[1,2]
                  )]) => AsTable)
2×3 DataFrame
 Row │ group  foo     boo
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     2  foo     [1, 2]

and (this is a different output, but the same intention)

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group), :group => ByRow(g ->
                  (;
                      foo="foo",
                      boo=[1,2]
                  )) => AsTable)
4×3 DataFrame
 Row │ group  foo     boo
     │ Int64  String  Array…
─────┼───────────────────────
   1 │     1  foo     [1, 2]
   2 │     1  foo     [1, 2]
   3 │     2  foo     [1, 2]
   4 │     2  foo     [1, 2]

What we would need to change is the "legacy" syntax (i.e. taking a single function), where the current rule is that we only recognize:

a data frame, a matrix, a NamedTuple, or a DataFrameRow

(and this is a closed list of recognized objects)

sprmnt21 · 2023-05-26T14:54:02Z

If I'm not getting too off topic, I'd like to try to put the question in a more general way.
What I'd find easiest to follow as a mindset would be one in which the result of cols =>fun whatever is, is intended to occupy a "cell" without being implicitly "interpreted" (expanded, flattened, ...).

Here are some fictitious examples:

julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           (; foo="foo", boo=[1,2] )
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  NamedTup…
─────┼────────────────────────────────────
   1 │     1  (foo = "foo", boo = [1, 2])
   2 │     2  (foo = "foo", boo = [1, 2])



julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           DataFrame([(;
               foo="foo",
               boo=[1,2]
           )])
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  DataFrame
─────┼──────────────────────
   1 │     1  1×2 DataFrame
   2 │     2  1×2 DataFrame


julia> combine(groupby(DataFrame(:group=>[1,1,2,2]), :group)) do g
           [
               (; foo="bar", boo=[1,2]),
               (; foo="baz", boo=[1,2])
           ]
       end
2×2 DataFrame
 Row │ group  x1
     │ Int64  Array…
─────┼──────────────────────────────────────────
   1 │     1  NamedTuple{(:foo, :boo), Tuple{S…
   2 │     2  NamedTuple{(:foo, :boo), Tuple{S…

The "expansion" of a NamedTuple or a DataFrame or any other "complex" structure should be explicitly required/constructed by the user.

ararslan · 2023-05-27T01:30:11Z

What would be the benefit of AsRow over Tables.Row that is already defined in the ecosystem?

The definition would be AsRow(fun) = Tables.Row ∘ fun. (so it would be a function not a type)

AsRow(f) permits the use of do blocks, e.g.

combine(gdf, src => AsRow() do x
            # some great stuff happens here
        end => AsTable)

With Tables.Row ∘ fun, the closest to this you can get is

combine(gdf, src => Tables.Row ∘ (x -> begin
            # some other stuff happens here
        end) => AsTable)

which is a bit more awkward to read and to write.

nalimilan · 2023-05-28T11:24:35Z

Adding support for Tables.Row seems uncontroversial. Maybe we can try that and see whether AsRow(fun) = Tables.Row ∘ fun is really needed? The fact that you would write AsRow() and AsTable in the same transformation sounds a bit weird to me. :-)

bkamins added this to the 1.6 milestone May 22, 2023

bkamins added the decision label May 22, 2023

bkamins mentioned this issue Jun 25, 2023

Correctly handle Tables.AbstractRow in operation specficiation #3348

Merged

bkamins closed this as completed in #3348 Jul 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accepting array element in rows specificed by named tuples, in `combine` #3335

Accepting array element in rows specificed by named tuples, in `combine` #3335

jonalm commented May 22, 2023

bkamins commented May 22, 2023

jonalm commented May 23, 2023 •

edited

Loading

bkamins commented May 23, 2023

jonalm commented May 23, 2023

bkamins commented May 23, 2023

pdeffebach commented May 23, 2023 •

edited

Loading

bkamins commented May 23, 2023

bkamins commented May 23, 2023

jonalm commented May 24, 2023

bkamins commented May 24, 2023 •

edited

Loading

jonalm commented May 24, 2023

bkamins commented May 24, 2023 •

edited

Loading

jonalm commented May 24, 2023

pdeffebach commented May 24, 2023 •

edited

Loading

bkamins commented May 24, 2023 •

edited

Loading

sprmnt21 commented May 25, 2023

bkamins commented May 26, 2023

sprmnt21 commented May 26, 2023

jonalm commented May 26, 2023

bkamins commented May 26, 2023

sprmnt21 commented May 26, 2023

ararslan commented May 27, 2023

nalimilan commented May 28, 2023

Accepting array element in rows specificed by named tuples, in combine #3335

Accepting array element in rows specificed by named tuples, in combine #3335

Comments

jonalm commented May 22, 2023

bkamins commented May 22, 2023

jonalm commented May 23, 2023 • edited Loading

bkamins commented May 23, 2023

jonalm commented May 23, 2023

bkamins commented May 23, 2023

pdeffebach commented May 23, 2023 • edited Loading

bkamins commented May 23, 2023

bkamins commented May 23, 2023

jonalm commented May 24, 2023

bkamins commented May 24, 2023 • edited Loading

jonalm commented May 24, 2023

bkamins commented May 24, 2023 • edited Loading

jonalm commented May 24, 2023

pdeffebach commented May 24, 2023 • edited Loading

bkamins commented May 24, 2023 • edited Loading

sprmnt21 commented May 25, 2023

bkamins commented May 26, 2023

sprmnt21 commented May 26, 2023

jonalm commented May 26, 2023

bkamins commented May 26, 2023

sprmnt21 commented May 26, 2023

ararslan commented May 27, 2023

nalimilan commented May 28, 2023

Accepting array element in rows specificed by named tuples, in `combine` #3335

Accepting array element in rows specificed by named tuples, in `combine` #3335

jonalm commented May 23, 2023 •

edited

Loading

pdeffebach commented May 23, 2023 •

edited

Loading

bkamins commented May 24, 2023 •

edited

Loading

bkamins commented May 24, 2023 •

edited

Loading

pdeffebach commented May 24, 2023 •

edited

Loading

bkamins commented May 24, 2023 •

edited

Loading