Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting in DataFrames #1804

Merged
merged 32 commits into from
Jun 8, 2019
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented May 6, 2019

This PR is a result of a discussion to take a staged approach at setindex!/broadcasting in DataFrames.jl.
We start with broadcasting.

This PR invalidates #1643 and #1646 but I will keep them open for a reference until nothing that was discussed there becomes irrelevant.

In general we treat AbstractDataFrame as a matrix in broadcasting and DataFrameRow as a vector.

The tricky part is that we want to allow adding columns using broadcasting. What I proposed works as expected, but is not entirely clean and uses Base internals a lot. Also if you review this PR please have a look at the tests as they expose corner cases (which are consistent with Base, although sometimes surprising).

@mbauman - we have discussed with @nalimilan that it would be great if you commented on this PR as the design of broadcasting in Base is complex and I am not sure if I have taken the best path. Thank you!

@bkamins bkamins closed this May 6, 2019
@bkamins bkamins reopened this May 6, 2019
Copy link

@mbauman mbauman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts after a quick read-through.

src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
docs/src/lib/indexing.md Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented May 6, 2019

@nalimilan - this should be good to look at although it fails CI. There is some problem with materialize! on Julia 1.0 only (I will look at what is the reason of this tomorrow) but otherwise the PR is good.

@bkamins
Copy link
Member Author

bkamins commented May 7, 2019

@mbauman thank you for your time and having a look at these issues.

The crucial part for me is if we need to materialize (especially in aliasing case) and how to best handle the problem with view you have highlighted. So in general the learning question on my side is how do you should safely use lazy bc.

@mbauman
Copy link

mbauman commented May 7, 2019

With respect to materialize, it's not really something that I intended for folks to call directly and I don't think there's a need to overload it except in very special cases.

The key thing to remember about broadcast is that it's entirely based upon indexing and axes. The typical way to use a bc object is to just iterate over eachindex and iteratively index into it. Note that what comes out of a dot-expression might not be a Broadcasted — it could be an immediately re-computed range, for example. With regards to alias-safety, we don't have a very friendly API to unalias broadcasted expression trees, but you could crib off of how Base preprocesses the broadcasted tree to do this.

@bkamins
Copy link
Member Author

bkamins commented May 7, 2019

Thank you for the tips on use of bc. I will try to rewrite the code in this way.

Regarding aliasing - the additional difficulty in DataFrames.jl as DataFrame is also a nested structure itself that can contain aliases (e.g. if the user really wants one can store the same vector several times under different names). But thinking about it probably we can simply document that the behavior of broadcasting when aliasing is a problem is unspecified and this is not something that should be done (actually recently we had to do a similar comment for push! and append! functions that are also not aliasing safe).

@bkamins
Copy link
Member Author

bkamins commented May 8, 2019

@nalimilan In Base broadcasting does not provide a cleanup on error, e.g.:

julia> x = ones(2)
2-element Array{Float64,1}:
 1.0
 1.0

julia> y = [2, "2"]
2-element Array{Any,1}:
 2
  "2"

julia> x .= y
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Float64

julia> x
2-element Array{Float64,1}:
 2.0
 1.0

Do you think it is acceptable we do the same, or we should make sure that we do not leave a data frame half-modified after an error?

@bkamins
Copy link
Member Author

bkamins commented May 8, 2019

@mbauman I have two design questions to you:

  • what are the guarantees on the value of elements produced by eachindex called on bc object. In particular (as this is a relevant possibly problematic case because data frames do not support linear indexing) - is it possible that eachindex(bc) will return a linear index if bc is two dimensional.
  • I wanted to make sure that I understand it correctly. The method https://github.com/JuliaData/DataFrames.jl/pull/1804/files#diff-57cbbe702db7443003fc8c590f239d70R48 (similar to one in Base) is needed only for performance. We could remove it and all should work. Is this right?

Thank you.

@bkamins
Copy link
Member Author

bkamins commented May 9, 2019

@nalimilan I am waiting with doing the changes in this PR as I would welcome a comment, given the discussed options, what you think we should do with column creation when we do broadcasting:

  1. create columns both on df[cols] .= v and df[col] .= v
  2. create columns only on df[col] .= v (this will simplify the design)
  3. never create columns and retain implicit broadcasting in df[col] = v and df[cols] = v
  4. never create columns and add explicit broadcasting via df[col] = Iterators.repeated(v) and df[cols] = Iterators.repeated(v) (we might decide to use some shorter alias Iterators.repeated if we wanted)

Thank you!

@nalimilan
Copy link
Member

I'd say 2 is fine if implementation isn't too complex. We can always implement df[cols] .= v later if we want.

@bkamins
Copy link
Member Author

bkamins commented May 10, 2019

OK. I will go for 2 then (it should be easier as we only have to handle a single special case so the general mechanics in all other cases will not be affected).

@bkamins
Copy link
Member Author

bkamins commented May 10, 2019

Just one more question. What should happen in this case:

df = DataFrame()
df[:a] .= 1

currently I create :a but keep df having 0-rows, but maybe we should create 1 row for convenience (and consistency with DataFrame constructor and insertcols; BTW: we should rethink when we want to retain automatic broadcasting in these two cases).

I understand that in this case:

df = DataFrame()
df[:a] .= [1,2,3]

we should copy the vector and create 3 rows.

@nalimilan
Copy link
Member

OK so we would special-case data frames with no columns and automatically resize them as needed? Then indeed it makes sense for df[:a] .= 1 to create one row, as we do in other places.

@bkamins
Copy link
Member Author

bkamins commented May 10, 2019

We could also disallow broadcasting into a DataFrame having zero columns (this would be simpler implementation-wise). But I just want to make sure that we discuss all corner cases before I start implementing it.

@nalimilan
Copy link
Member

As you prefer. I guess it's safer from an API standpoint to throw an error when there are no columns, that way we can make a decision later based on the collected experience with .=.

@bkamins
Copy link
Member Author

bkamins commented May 10, 2019

@mbauman + @nalimilan I have cleaned up the code following all the comments. Given the simplifications we allow now it is much cleaner and uses only broadcastable, maybeview and copyto! which I hope is the expected way to do it.

I have three comments:

  • We allow adding columns using broadcasting only using df[col] .= v syntax if !isempty(df)).
  • In general we could consider in the future to add support for something like copyto!(df[1:2, 1:2], rand(2,2)), which will now throw an error as I expect that the source in copyto! supports eachindex and the produced values are CartesianIndex (which is not true for not-broadcasted matrices as they use linear indexing). @nalimilan - do you think we should add this support or leave it for later?
  • in copyto! I do not check if the source and destination have the same dimensions (this is done by broadcasting infrastructure earlier); however, this means that if someone uses copyto! directly one can corrupt a destination (if source is too large - here Base errors) or silently produce an unexpected result (is source is too small - here the same happens in base). Also please comment if you have any perspective on this.

src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
src/other/broadcasting.jl Show resolved Hide resolved
src/other/broadcasting.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
@nalimilan
Copy link
Member

* We allow adding columns using broadcasting only using `df[col] .= v` syntax if `!isempty(df))`.

OK.

* In general we could consider in the future to add support for something like `copyto!(df[1:2, 1:2], rand(2,2))`, which will now throw an error as I expect that the source in `copyto!` supports `eachindex` and the produced values are `CartesianIndex` (which is not true for not-broadcasted matrices as they use linear indexing). @nalimilan - do you think we should add this support or leave it for later?

Why not use CartesianIndices instead of eachindex? That shouldn't be slower. Anyway (as I noted in a comment) I think for performance we should iterate column-wise.

* in `copyto!` I do not check if the source and destination have the same dimensions (this is done by broadcasting infrastructure earlier); however, this means that if someone uses `copyto!` directly one can corrupt a destination (if source is too large - here Base errors) or silently produce an unexpected result (is source is too small - here the same happens in base). Also please comment if you have any perspective on this.

If possible I think the signatures should be stricter to ensure they are only used for broadcasting.

Co-Authored-By: Milan Bouchet-Valat <[email protected]>
@bkamins
Copy link
Member Author

bkamins commented May 11, 2019

Why not use CartesianIndices instead of eachindex?

Because only eachindex is guaranteed to be defined by broadcasting infrastructure (again - if I understand @mbauman correctly). CartesianIndices do not have to be defined.

@bkamins
Copy link
Member Author

bkamins commented May 29, 2019

Thank you for the comments. I have rewritten the tests to be more robust (it would not hurt to have second eyes look at the - although now they are really cumbersome to check unfortunately). I have left four your comments unresolved as you might have some further comments on these issues.

test/broadcasting.jl Outdated Show resolved Hide resolved
test/broadcasting.jl Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if you confirm the compiler optimizes the eltype computation and you apply the suggestion. Great work!

@bkamins
Copy link
Member Author

bkamins commented May 31, 2019

I have added tests, fixed one type instability in axes definition and added a "fast" path for most common simple case of writing df[:col] .= scalar.

Here are some benchmarks:

julia> function f2()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1
               df[:b] = 1
               df[:c] = 1
               df[:d] = 1
               df[:e] = 1
               df[:f] = 1
               df[:g] = 1
               df[:h] = 1
               df[:i] = 1
               df[:j] = 1
           end
           nothing
       end
f2 (generic function with 1 method)

julia> function f3()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1:10^6
               df[:b] .= 1:10^6
julia> function f1()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1
               df[:b] .= 1
               df[:c] .= 1
               df[:d] .= 1
               df[:e] .= 1
               df[:f] .= 1
               df[:g] .= 1
               df[:h] .= 1
               df[:i] .= 1
               df[:j] .= 1
           end
           nothing
       end
f1 (generic function with 1 method)

julia> function f2()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1
               df[:b] = 1
               df[:c] = 1
               df[:d] = 1
               df[:e] = 1
               df[:f] = 1
               df[:g] = 1
               df[:h] = 1
               df[:i] = 1
               df[:j] = 1
           end
           nothing
       end
f2 (generic function with 1 method)

julia> function f3()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] .= 1:10^6
               df[:b] .= 1:10^6
               df[:c] .= 1:10^6
               df[:d] .= 1:10^6
               df[:e] .= 1:10^6
               df[:f] .= 1:10^6
               df[:g] .= 1:10^6
               df[:h] .= 1:10^6
               df[:i] .= 1:10^6
               df[:j] .= 1:10^6
           end
           nothing
       end
f3 (generic function with 1 method)

julia> function f4()
           for i in 1:10
               df = DataFrame(x=1:10^6)
               df[:a] = 1:10^6
               df[:b] = 1:10^6
               df[:c] = 1:10^6
               df[:d] = 1:10^6
               df[:e] = 1:10^6
               df[:f] = 1:10^6
               df[:g] = 1:10^6
               df[:h] = 1:10^6
               df[:i] = 1:10^6
               df[:j] = 1:10^6
           end
           nothing
       end
f4 (generic function with 1 method)

julia> @benchmark f1()
BenchmarkTools.Trial:
  memory estimate:  839.33 MiB
  allocs estimate:  1960
  --------------
  minimum time:     1.289 s (51.17% GC)
  median time:      1.342 s (52.85% GC)
  mean time:        1.348 s (52.60% GC)
  maximum time:     1.418 s (54.07% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f2()
BenchmarkTools.Trial:
  memory estimate:  839.28 MiB
  allocs estimate:  860
  --------------
  minimum time:     1.300 s (50.47% GC)
  median time:      1.347 s (52.75% GC)
  mean time:        1.347 s (52.59% GC)
  maximum time:     1.394 s (54.26% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f3()
BenchmarkTools.Trial:
  memory estimate:  839.33 MiB
  allocs estimate:  1960
  --------------
  minimum time:     1.424 s (46.23% GC)
  median time:      1.490 s (47.96% GC)
  mean time:        1.487 s (47.94% GC)
  maximum time:     1.543 s (49.47% GC)
  --------------
  samples:          4
  evals/sample:     1

julia> @benchmark f4()
BenchmarkTools.Trial:
  memory estimate:  839.28 MiB
  allocs estimate:  760
  --------------
  minimum time:     1.302 s (50.31% GC)
  median time:      1.358 s (52.73% GC)
  mean time:        1.357 s (52.53% GC)
  maximum time:     1.411 s (54.19% GC)
  --------------
  samples:          4
  evals/sample:     1

The conclusion is that in "fast common patch" we are on par to what we had. If there is a need of mapreduce there is less than 10% penalty, but I guess this will not be a common use case (in such case you can still use the df[:col] = vector assignment anyway).

@bkamins
Copy link
Member Author

bkamins commented Jun 1, 2019

@nalimilan I know it is approved, but the change is significant to please confirm to me that you are OK with this to be merged 😄.

@bkamins
Copy link
Member Author

bkamins commented Jun 4, 2019

I am going to merge it and move forward to further broadcasting/setindex! issues if there are no more comments on this.

@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2019

CI will fail now. I will retrigger it after JuliaRegistries/General#1254 is merged.

@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2019

As noted on Slack it works by using similar(Vector{T}, nrow(lazydf.df)) which is strange as actually sometimes we do not get Vector but some other container type as a result.

@nalimilan
Copy link
Member

Yes that's kind of strange, but OTOH you're not supposed to store CategoricalValue in an Array. CategoricalArray{T} behaves like Array{CategoricalValue{T}} for all purposes.

src/other/broadcasting.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Member Author

bkamins commented Jun 7, 2019

I have merged this PR with master to make sure all is in sync.

@bkamins bkamins merged commit 514017d into JuliaData:master Jun 8, 2019
@bkamins bkamins deleted the broadcasting_dataframe branch June 8, 2019 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants