Skip to content

Commit

Permalink
Respect copycols when building DataFrame from Tables.CopiedColumns (#…
Browse files Browse the repository at this point in the history
…2656)

* Respect copycols when building DataFrame from Tables.CopiedColumns

Fixes apache/arrow-julia#146. The issue here
was we were just ignoring `copycols` entirely when building DataFrame
from `Tables.CopiedColumns`. Which is mostly fine because
`CopiedColumns` implies DataFrames can assume ownership of the columns.
The wrinkle comes in when you consider cases like `Arrow.Table`, which
wraps columns in `Tables.CopiedColumns`, not necessarily because copies
have been made, but because the data is immutable, so it should be safe
for other tables to assume ownership. The problem is when users want to
_mutate_ those columns, they need to make copies of them, so naturally
users try `DataFrame(::Arrow.Table; copycols=true)`, but then that
doesn't actually make copies.

The proposal here just intercepts the top-level
`DataFrame(::CopiedColumns)` constructor and passes `copycols=false` as
the default, so if users pass `copycols=true`, it will still be
respected.

* Update src/other/tables.jl

Co-authored-by: Bogumił Kamiński <[email protected]>
  • Loading branch information
quinnj and bkamins authored Mar 12, 2021
1 parent 9d87bcf commit 140a3e0
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 5 deletions.
13 changes: 8 additions & 5 deletions src/other/tables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -32,16 +32,19 @@ Tables.getcolumn(dfr::DataFrameRow, nm::Symbol) = dfr[nm]

getvector(x::AbstractVector) = x
getvector(x) = [x[i] for i = 1:length(x)]
# note that copycols is ignored in this definition (Tables.CopiedColumns implies copies have already been made)
fromcolumns(x::Tables.CopiedColumns, names; copycols::Bool=true) =
DataFrame(AbstractVector[getvector(Tables.getcolumn(x, nm)) for nm in names],
Index(names),
copycols=false)

fromcolumns(x, names; copycols::Bool=true) =
DataFrame(AbstractVector[getvector(Tables.getcolumn(x, nm)) for nm in names],
Index(names),
copycols=copycols)

# note that copycols is false by default in this definition (Tables.CopiedColumns
# implies copies have already been made) but if `copycols=true`, a copy will still be
# made; this is useful for scenarios where the input is immutable so avoiding copies
# is desirable, but you may still want a copy for mutation (Arrow.Table is like this)
DataFrame(x::Tables.CopiedColumns; copycols::Bool=false) =
DataFrame(Tables.source(x); copycols=copycols)

function DataFrame(x::T; copycols::Bool=true) where {T}
if !Tables.istable(x) && x isa AbstractVector && !isempty(x)
# here we handle eltypes not specific enough to be dispatched
Expand Down
5 changes: 5 additions & 0 deletions test/tables.jl
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,11 @@ end
@test Tables.columntable(df3) == nt
@test Tables.columntable(df3) !== nt

df4 = DataFrame(Tables.CopiedColumns(nt))
@test df4.a === nt.a
df4 = DataFrame(Tables.CopiedColumns(nt), copycols=true)
@test df4.a !== nt.a

v = [(a=1, b=2), (a=3, b=4)]
df = DataFrame(v)
@test size(df) == (2, 2)
Expand Down

0 comments on commit 140a3e0

Please sign in to comment.