Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when combining a grouped empty dataframe using first #3426

Closed
ctarn opened this issue Mar 1, 2024 · 6 comments
Closed

error when combining a grouped empty dataframe using first #3426

ctarn opened this issue Mar 1, 2024 · 6 comments

Comments

@ctarn
Copy link

ctarn commented Mar 1, 2024

It would be better if we could return an empty dataframe with the same cols, instead of raising an error. Thanks!

it is ok:

df = DataFrames.DataFrame(x=Int[1, 1, 2, 2], y=Int[1, 2, 3, 4])
gd = DataFrames.groupby(df, :x)
DataFrames.combine(gd, :y => Ref)
"""
2×2 DataFrame
 Row │ x      y_Ref     
     │ Int64  SubArray… 
─────┼──────────────────
   1 │     1  [1, 2]
   2 │     2  [3, 4]
"""

df = DataFrames.DataFrame(x=Int[], y=Int[])
gd = DataFrames.groupby(df, :x)
DataFrames.combine(gd, :y => Ref)
"""
0×2 DataFrame
 Row │ x      y_Ref     
     │ Int64  SubArray… 
─────┴──────────────────
"""

error:

df = DataFrames.DataFrame(x=Int[1, 1, 2, 2], y=Int[1, 2, 3, 4])
gd = DataFrames.groupby(df, :x)
DataFrames.combine(gd, :y => first) |> display
"""
2×2 DataFrame
 Row │ x      y_first 
     │ Int64  Int64   
─────┼────────────────
   1 │     1        1
   2 │     2        3
"""

df = DataFrames.DataFrame(x=Int[], y=Int[])
gd = DataFrames.groupby(df, :x)
DataFrames.combine(gd, :y => first) |> display
"""
ERROR: BoundsError: attempt to access 0-element view(::Vector{Int64}, Int64[]) with eltype Int64 at index [1]
Stacktrace:
 [1] _combine(gd::DataFrames.GroupedDataFrame{…}, cs_norm::Vector{…}, optional_transform::Vector{…}, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:755
 [2] _combine_prepare_norm(gd::DataFrames.GroupedDataFrame{…}, cs_vec::Vector{…}, keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:87
 [3] _combine_prepare(gd::DataFrames.GroupedDataFrame{…}, ::Base.RefValue{…}; keepkeys::Bool, ungroup::Bool, copycols::Bool, keeprows::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:52
 [4] _combine_prepare
   @ ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:26 [inlined]
 [5] combine(gd::DataFrames.GroupedDataFrame{…}, args::Union{…}; keepkeys::Bool, ungroup::Bool, renamecols::Bool, threads::Bool)
   @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/groupeddataframe/splitapplycombine.jl:857
 [6] top-level scope
   @ Untitled-1:7
...
@bkamins
Copy link
Member

bkamins commented Mar 1, 2024

It is a first error not a DataFrame.jl error. You need to write e.g.:

julia> DataFrames.combine(gd, :y => v -> isempty(v) ? v : first(v))
0×2 DataFrame
 Row │ x      y_function
     │ Int64  Int64
─────┴───────────────────

@bkamins bkamins closed this as completed Mar 1, 2024
@ctarn
Copy link
Author

ctarn commented Mar 1, 2024

Thank you very much! May I ask how the type of y_function is determined? Is Int64 the default type?

@bkamins
Copy link
Member

bkamins commented Mar 1, 2024

No, it is determined by the type of the :y column:

julia> @code_warntype (v -> isempty(v) ? v : first(v))([1,2,3])
MethodInstance for (::var"#5#6")(::Vector{Int64})
  from (::var"#5#6")(v) @ Main REPL[3]:1
Arguments
  #self#::Core.Const(var"#5#6"())
  v::Vector{Int64}
Body::Union{Int64, Vector{Int64}}
1 ─ %1 = Main.isempty(v)::Bool
└──      goto #3 if not %1
2 ─      return v
3 ─ %4 = Main.first(v)::Int64
└──      return %4

As you can see the compiler can infer what is needed.

@ctarn
Copy link
Author

ctarn commented Mar 1, 2024

Can we process combine in the following way?

result = initial empty dataframes of specified empty cols
for group in groups
    row = process(group)
    add row to results
end
return result

We can also process it col by col instead of row by row.

Since the types of all cols can be inferred no matter whether a grouped dataframe is empty or not, generating an initial empty dataframe of consistent types would be possible, and users don't have to use other functions to handle specific cases.

@ctarn
Copy link
Author

ctarn commented Mar 1, 2024

Since the grouped dataframe is empty, functions such as first is not called actually, and thus such an error should not be raised?

@bkamins
Copy link
Member

bkamins commented Mar 2, 2024

Can we process combine in the following way?

yes

thus such an error should not be raised?

yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants