Broadcasting in DataFrames #1804

bkamins · 2019-05-06T21:47:25Z

This PR is a result of a discussion to take a staged approach at setindex!/broadcasting in DataFrames.jl.
We start with broadcasting.

This PR invalidates #1643 and #1646 but I will keep them open for a reference until nothing that was discussed there becomes irrelevant.

In general we treat AbstractDataFrame as a matrix in broadcasting and DataFrameRow as a vector.

The tricky part is that we want to allow adding columns using broadcasting. What I proposed works as expected, but is not entirely clean and uses Base internals a lot. Also if you review this PR please have a look at the tests as they expose corner cases (which are consistent with Base, although sometimes surprising).

@mbauman - we have discussed with @nalimilan that it would be great if you commented on this PR as the design of broadcasting in Base is complex and I am not sure if I have taken the best path. Thank you!

mbauman

Just a few thoughts after a quick read-through.

src/other/broadcasting.jl

docs/src/lib/indexing.md

bkamins · 2019-05-06T22:45:05Z

@nalimilan - this should be good to look at although it fails CI. There is some problem with materialize! on Julia 1.0 only (I will look at what is the reason of this tomorrow) but otherwise the PR is good.

bkamins · 2019-05-07T07:04:42Z

@mbauman thank you for your time and having a look at these issues.

The crucial part for me is if we need to materialize (especially in aliasing case) and how to best handle the problem with view you have highlighted. So in general the learning question on my side is how do you should safely use lazy bc.

mbauman · 2019-05-07T18:12:56Z

With respect to materialize, it's not really something that I intended for folks to call directly and I don't think there's a need to overload it except in very special cases.

The key thing to remember about broadcast is that it's entirely based upon indexing and axes. The typical way to use a bc object is to just iterate over eachindex and iteratively index into it. Note that what comes out of a dot-expression might not be a Broadcasted — it could be an immediately re-computed range, for example. With regards to alias-safety, we don't have a very friendly API to unalias broadcasted expression trees, but you could crib off of how Base preprocesses the broadcasted tree to do this.

bkamins · 2019-05-07T23:06:39Z

Thank you for the tips on use of bc. I will try to rewrite the code in this way.

Regarding aliasing - the additional difficulty in DataFrames.jl as DataFrame is also a nested structure itself that can contain aliases (e.g. if the user really wants one can store the same vector several times under different names). But thinking about it probably we can simply document that the behavior of broadcasting when aliasing is a problem is unspecified and this is not something that should be done (actually recently we had to do a similar comment for push! and append! functions that are also not aliasing safe).

bkamins · 2019-05-08T05:57:50Z

@nalimilan In Base broadcasting does not provide a cleanup on error, e.g.:

julia> x = ones(2)
2-element Array{Float64,1}:
 1.0
 1.0

julia> y = [2, "2"]
2-element Array{Any,1}:
 2
  "2"

julia> x .= y
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64

julia> x
2-element Array{Float64,1}:
 2.0
 1.0

Do you think it is acceptable we do the same, or we should make sure that we do not leave a data frame half-modified after an error?

bkamins · 2019-05-08T07:19:59Z

@mbauman I have two design questions to you:

what are the guarantees on the value of elements produced by eachindex called on bc object. In particular (as this is a relevant possibly problematic case because data frames do not support linear indexing) - is it possible that eachindex(bc) will return a linear index if bc is two dimensional.
I wanted to make sure that I understand it correctly. The method https://github.com/JuliaData/DataFrames.jl/pull/1804/files#diff-57cbbe702db7443003fc8c590f239d70R48 (similar to one in Base) is needed only for performance. We could remove it and all should work. Is this right?

Thank you.

bkamins · 2019-05-09T21:30:04Z

@nalimilan I am waiting with doing the changes in this PR as I would welcome a comment, given the discussed options, what you think we should do with column creation when we do broadcasting:

create columns both on df[cols] .= v and df[col] .= v
create columns only on df[col] .= v (this will simplify the design)
never create columns and retain implicit broadcasting in df[col] = v and df[cols] = v
never create columns and add explicit broadcasting via df[col] = Iterators.repeated(v) and df[cols] = Iterators.repeated(v) (we might decide to use some shorter alias Iterators.repeated if we wanted)

Thank you!

nalimilan · 2019-05-10T07:47:55Z

I'd say 2 is fine if implementation isn't too complex. We can always implement df[cols] .= v later if we want.

bkamins · 2019-05-10T07:51:04Z

OK. I will go for 2 then (it should be easier as we only have to handle a single special case so the general mechanics in all other cases will not be affected).

bkamins · 2019-05-10T07:56:28Z

Just one more question. What should happen in this case:

df = DataFrame()
df[:a] .= 1

currently I create :a but keep df having 0-rows, but maybe we should create 1 row for convenience (and consistency with DataFrame constructor and insertcols; BTW: we should rethink when we want to retain automatic broadcasting in these two cases).

I understand that in this case:

df = DataFrame()
df[:a] .= [1,2,3]

we should copy the vector and create 3 rows.

nalimilan · 2019-05-10T08:12:09Z

OK so we would special-case data frames with no columns and automatically resize them as needed? Then indeed it makes sense for df[:a] .= 1 to create one row, as we do in other places.

bkamins · 2019-05-10T08:19:26Z

We could also disallow broadcasting into a DataFrame having zero columns (this would be simpler implementation-wise). But I just want to make sure that we discuss all corner cases before I start implementing it.

nalimilan · 2019-05-10T09:18:32Z

As you prefer. I guess it's safer from an API standpoint to throw an error when there are no columns, that way we can make a decision later based on the collected experience with .=.

bkamins · 2019-05-10T13:44:18Z

@mbauman + @nalimilan I have cleaned up the code following all the comments. Given the simplifications we allow now it is much cleaner and uses only broadcastable, maybeview and copyto! which I hope is the expected way to do it.

I have three comments:

We allow adding columns using broadcasting only using df[col] .= v syntax if !isempty(df)).
In general we could consider in the future to add support for something like copyto!(df[1:2, 1:2], rand(2,2)), which will now throw an error as I expect that the source in copyto! supports eachindex and the produced values are CartesianIndex (which is not true for not-broadcasted matrices as they use linear indexing). @nalimilan - do you think we should add this support or leave it for later?
in copyto! I do not check if the source and destination have the same dimensions (this is done by broadcasting infrastructure earlier); however, this means that if someone uses copyto! directly one can corrupt a destination (if source is too large - here Base errors) or silently produce an unexpected result (is source is too small - here the same happens in base). Also please comment if you have any perspective on this.

src/other/broadcasting.jl

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

nalimilan · 2019-05-11T15:57:00Z

* We allow adding columns using broadcasting only using `df[col] .= v` syntax if `!isempty(df))`.

OK.

* In general we could consider in the future to add support for something like `copyto!(df[1:2, 1:2], rand(2,2))`, which will now throw an error as I expect that the source in `copyto!` supports `eachindex` and the produced values are `CartesianIndex` (which is not true for not-broadcasted matrices as they use linear indexing). @nalimilan - do you think we should add this support or leave it for later?

Why not use CartesianIndices instead of eachindex? That shouldn't be slower. Anyway (as I noted in a comment) I think for performance we should iterate column-wise.

* in `copyto!` I do not check if the source and destination have the same dimensions (this is done by broadcasting infrastructure earlier); however, this means that if someone uses `copyto!` directly one can corrupt a destination (if source is too large - here Base errors) or silently produce an unexpected result (is source is too small - here the same happens in base). Also please comment if you have any perspective on this.

If possible I think the signatures should be stricter to ensure they are only used for broadcasting.

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2019-05-11T16:19:00Z

Why not use CartesianIndices instead of eachindex?

Because only eachindex is guaranteed to be defined by broadcasting infrastructure (again - if I understand @mbauman correctly). CartesianIndices do not have to be defined.

…adcasting_dataframe

bkamins · 2019-05-29T11:46:48Z

Thank you for the comments. I have rewritten the tests to be more robust (it would not hurt to have second eyes look at the - although now they are really cumbersome to check unfortunately). I have left four your comments unresolved as you might have some further comments on these issues.

test/broadcasting.jl

nalimilan

I think it's fine if you confirm the compiler optimizes the eltype computation and you apply the suggestion. Great work!

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2019-05-31T10:34:10Z

I have added tests, fixed one type instability in axes definition and added a "fast" path for most common simple case of writing df[:col] .= scalar.

Here are some benchmarks:

julia> function f2()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1
               df[:b] = 1
               df[:c] = 1
               df[:d] = 1
               df[:e] = 1
               df[:f] = 1
               df[:g] = 1
               df[:h] = 1
               df[:i] = 1
               df[:j] = 1
           end
           nothing
       end
f2 (generic function with 1 method)

julia> function f3()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1:10^6
               df[:b] .= 1:10^6
julia> function f1()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1
               df[:b] .= 1
               df[:c] .= 1
               df[:d] .= 1
               df[:e] .= 1
               df[:f] .= 1
               df[:g] .= 1
               df[:h] .= 1
               df[:i] .= 1
               df[:j] .= 1
           end
           nothing
       end
f1 (generic function with 1 method)

julia> function f2()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1
               df[:b] = 1
               df[:c] = 1
               df[:d] = 1
               df[:e] = 1
               df[:f] = 1
               df[:g] = 1
               df[:h] = 1
               df[:i] = 1
               df[:j] = 1
           end
           nothing
       end
f2 (generic function with 1 method)

julia> function f3()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1:10^6
               df[:b] .= 1:10^6
               df[:c] .= 1:10^6
               df[:d] .= 1:10^6
               df[:e] .= 1:10^6
               df[:f] .= 1:10^6
               df[:g] .= 1:10^6
               df[:h] .= 1:10^6
               df[:i] .= 1:10^6
               df[:j] .= 1:10^6
           end
           nothing
       end
f3 (generic function with 1 method)

julia> function f4()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1:10^6
               df[:b] = 1:10^6
               df[:c] = 1:10^6
               df[:d] = 1:10^6
               df[:e] = 1:10^6
               df[:f] = 1:10^6
               df[:g] = 1:10^6
               df[:h] = 1:10^6
               df[:i] = 1:10^6
               df[:j] = 1:10^6
           end
           nothing
       end
f4 (generic function with 1 method)

julia> @benchmark f1()
BenchmarkTools.Trial:
  memory estimate:  839.33 MiB
  allocs estimate:  1960
  --------------
  minimum time:     1.289 s (51.17% GC)
  median time:      1.342 s (52.85% GC)
  mean time:        1.348 s (52.60% GC)
  maximum time:     1.418 s (54.07% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f2()
BenchmarkTools.Trial:
  memory estimate:  839.28 MiB
  allocs estimate:  860
  --------------
  minimum time:     1.300 s (50.47% GC)
  median time:      1.347 s (52.75% GC)
  mean time:        1.347 s (52.59% GC)
  maximum time:     1.394 s (54.26% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f3()
BenchmarkTools.Trial:
  memory estimate:  839.33 MiB
  allocs estimate:  1960
  --------------
  minimum time:     1.424 s (46.23% GC)
  median time:      1.490 s (47.96% GC)
  mean time:        1.487 s (47.94% GC)
  maximum time:     1.543 s (49.47% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f4()
BenchmarkTools.Trial:
  memory estimate:  839.28 MiB
  allocs estimate:  760
  --------------
  minimum time:     1.302 s (50.31% GC)
  median time:      1.358 s (52.73% GC)
  mean time:        1.357 s (52.53% GC)
  maximum time:     1.411 s (54.19% GC)
  --------------
  samples:          4
  evals/sample:     1

The conclusion is that in "fast common patch" we are on par to what we had. If there is a need of mapreduce there is less than 10% penalty, but I guess this will not be a common use case (in such case you can still use the df[:col] = vector assignment anyway).

bkamins · 2019-06-01T15:25:47Z

@nalimilan I know it is approved, but the change is significant to please confirm to me that you are OK with this to be merged 😄.

bkamins · 2019-06-04T16:31:56Z

I am going to merge it and move forward to further broadcasting/setindex! issues if there are no more comments on this.

src/other/broadcasting.jl

bkamins · 2019-06-07T14:57:01Z

CI will fail now. I will retrigger it after JuliaRegistries/General#1254 is merged.

bkamins · 2019-06-07T18:57:52Z

As noted on Slack it works by using similar(Vector{T}, nrow(lazydf.df)) which is strange as actually sometimes we do not get Vector but some other container type as a result.

nalimilan · 2019-06-07T20:07:49Z

Yes that's kind of strange, but OTOH you're not supposed to store CategoricalValue in an Array. CategoricalArray{T} behaves like Array{CategoricalValue{T}} for all purposes.

src/other/broadcasting.jl

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins · 2019-06-07T22:22:18Z

I have merged this PR with master to make sure all is in sync.

bkamins added 2 commits May 6, 2019 23:36

broadcasting in DataFrames

aaa9b92

add two new files to the repo

1477b5b

bkamins closed this May 6, 2019

bkamins reopened this May 6, 2019

bkamins added 2 commits May 7, 2019 00:17

fix copyto!

43d0e2e

try to fix Julia 1.0 error

428e219

mbauman reviewed May 6, 2019

View reviewed changes

small fixes

554ea3b

new design after the review

bf2a5f1

nalimilan reviewed May 11, 2019

View reviewed changes

Update src/other/broadcasting.jl

2f27a3d

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

Update src/other/broadcasting.jl

9efdb29

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

bkamins mentioned this pull request May 11, 2019

Allow data frame and DataFrameRow to take part in broadcasting #1806

Closed

corrections after code review

d7ccdb3

bkamins added 2 commits May 29, 2019 13:42

improve tests

974466f

Merge remote-tracking branch 'origin/broadcasting_dataframe' into bro…

0ea04e9

…adcasting_dataframe

nalimilan reviewed May 30, 2019

View reviewed changes

test/broadcasting.jl Outdated Show resolved Hide resolved

test/broadcasting.jl Show resolved Hide resolved

nalimilan approved these changes May 30, 2019

View reviewed changes

bkamins and others added 3 commits May 31, 2019 11:45

Update test/broadcasting.jl

06d86ce

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

tests of single column matrix in broadcasting

1ce5d2f

performance improvements

09af7d9

nalimilan reviewed Jun 4, 2019

View reviewed changes

src/other/broadcasting.jl Show resolved Hide resolved

bkamins added 5 commits June 5, 2019 20:39

improved copyto!

e0db92e

add another fast branch

4917012

some additional tests

9e9aaad

fix typo

181c5fb

use new categoricalarrays broadcasting support

0c6843b

bkamins added 2 commits June 7, 2019 17:24

minor code improvements after the review

e3c5d57

fix similar

7cdee94

nalimilan reviewed Jun 7, 2019

View reviewed changes

src/other/broadcasting.jl Outdated Show resolved Hide resolved

bkamins and others added 3 commits June 7, 2019 22:27

Update src/other/broadcasting.jl

1c11e70

Co-Authored-By: Milan Bouchet-Valat <[email protected]>

up CatArrays version

43a383a

Merge branch 'master' into broadcasting_dataframe

0caa3ca

nalimilan approved these changes Jun 8, 2019

View reviewed changes

bkamins merged commit 514017d into JuliaData:master Jun 8, 2019

bkamins deleted the broadcasting_dataframe branch June 8, 2019 13:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broadcasting in DataFrames #1804

Broadcasting in DataFrames #1804

bkamins commented May 6, 2019

mbauman left a comment

bkamins commented May 6, 2019

bkamins commented May 7, 2019

mbauman commented May 7, 2019

bkamins commented May 7, 2019

bkamins commented May 8, 2019

bkamins commented May 8, 2019

bkamins commented May 9, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 11, 2019

bkamins commented May 11, 2019

bkamins commented May 29, 2019

nalimilan left a comment

bkamins commented May 31, 2019

bkamins commented Jun 1, 2019

bkamins commented Jun 4, 2019

bkamins commented Jun 7, 2019

bkamins commented Jun 7, 2019 •

edited

Loading

nalimilan commented Jun 7, 2019

bkamins commented Jun 7, 2019

Broadcasting in DataFrames #1804

Broadcasting in DataFrames #1804

Conversation

bkamins commented May 6, 2019

mbauman left a comment

Choose a reason for hiding this comment

bkamins commented May 6, 2019

bkamins commented May 7, 2019

mbauman commented May 7, 2019

bkamins commented May 7, 2019

bkamins commented May 8, 2019

bkamins commented May 8, 2019

bkamins commented May 9, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 10, 2019

bkamins commented May 10, 2019

nalimilan commented May 11, 2019

bkamins commented May 11, 2019

bkamins commented May 29, 2019

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented May 31, 2019

bkamins commented Jun 1, 2019

bkamins commented Jun 4, 2019

bkamins commented Jun 7, 2019

bkamins commented Jun 7, 2019 • edited Loading

nalimilan commented Jun 7, 2019

bkamins commented Jun 7, 2019

bkamins commented Jun 7, 2019 •

edited

Loading