-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add broadcasting of AbstractDataFrame #1840
Add broadcasting of AbstractDataFrame #1840
Conversation
@nalimilan - the tests pass on Julia 1.1, but fail on Julia 1.0 due to a check against |
Fortunately all works after all. It was another change in broadcasting between 1.0 and 1.1 that caused the problem but it is unrelated to this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalimilan - in particular have a look at how
CategoricalVector
is transformed in broadcasting using currentsimilar
rules. This is visible in the test starting withdf3 = coalesce.(df, nothing)
line
That sounds almost too good to be true! It's really great that broadcasting preserves CategoricalArray
when the function returns a CategoricalValue
or Missing
, but returns a plain Array{T}
when it returns CategoricalValue
or Nothing
(and not Array{Union{CategoricalValue{T},Nothing}}
, which isn't very useful).
That works because promote_type(Nothing, CategoricalString{UInt32}) === Union{Nothing,String}
. Unfortunately, that's not completely consistent with what happens with standard broadcasting over arrays, since it uses promote_typejoin
, which returns Union{Nothing,CategoricalString{UInt32}}
. Maybe we should discuss separately whether we should keep using promote_type
instead of promote_typejoin
(the idea was that you almost always want homogeneous column types when working with tabular data, contrary to general programming with arrays which is harder to decide).
You have the most experience here (but as you say we can leave it for later). I think the difference is not that big in practice as this case should not happen often (if at all). If we change it then probably we should make sure that we make it consistent everywhere (in particular in original |
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
@nalimilan In the process of analysis of @mbauman suggestions I have identified an aliasing bug we had in broadcasting assignment. |
@nalimilan - this PR should be good for another look at. I have implemented fast handling of columns and alias detection in all cases (I hope). Additional test case suggestions would be appreciated as the code, although not very long, was non-trivial for me. |
I have rebased this PR against master and added even more tests. |
Benchmarks Simple broadcasting
Assignment broadcasting (here the performance of data frames drops):
Now the case where we are hit by anty-aliasing analysis most:
This one is slow:
because it does not use custom optimizations (we have a data frame on RHS, but not on LHS). OTOH this is faster than
as aliasing analysis in this case is cheap. So in summary: if you have a very large number of columns it is better to work with matrices than data frames (which probably could be expected). |
In data frame unaliasing code I can add a line:
which:
The general reason why unaliasing is slow is because in this code we have to do:
which is not type stable (so all subsequent checks require dynamic dispatch) and this is performed O( |
Sounds like a good idea. It shouldn't be very costly compared to the unalias checks we already perform. That will make |
After discussion with @nalimilan I have added some performance improvements in unaliasing. Here are the updated benchmarks. I report the relevant cases. Long data frame:
Wide data frame:
We can see that in the case of wide data frame (1000 columns) unaliasing costs ~25% of total time. The biggest penalty for this range is the fact that data frame is not type stable (when we process a small number of long columns we do not have this problem as most work is done behind a barrier function). |
@nalimilan Thank you for working on it. |
and @mbauman for an excellent suggestion of the column extraction approach. |
Design principles of data frame broadcasting:
|
@nalimilan I have done a final round of code check after your approval, which lead to some small changes (unfortunately this PR is hard for me to get it right in one run). They are collected in the last two commits (the first adds tests, the second changed documentation and implementation). Changes:
|
Co-Authored-By: Milan Bouchet-Valat <[email protected]>
This resolves #1806 and #1610.
@nalimilan - in particular have a look at how
CategoricalVector
is transformed in broadcasting using currentsimilar
rules. This is visible in the test starting withdf3 = coalesce.(df, nothing)
line@mbauman - could you have a look at this PR if you have time to spare (and also possibly at the whole other/broadcasting.jl file which now lays out full rules of broadcasting of data frames)