Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the broadcasted getproperty when Julia 1.7 is out #2804

Closed
bkamins opened this issue Jun 24, 2021 · 8 comments · Fixed by #3022
Closed

Update the broadcasted getproperty when Julia 1.7 is out #2804

bkamins opened this issue Jun 24, 2021 · 8 comments · Fixed by #3022
Labels
ecosystem Issues in DataFrames.jl ecosystem
Milestone

Comments

@bkamins
Copy link
Member

bkamins commented Jun 24, 2021

Just to keep track of it. See #2655.

@bkamins bkamins added the ecosystem Issues in DataFrames.jl ecosystem label Jun 24, 2021
@bkamins bkamins added this to the patch milestone Jun 24, 2021
@bkamins
Copy link
Member Author

bkamins commented Oct 26, 2021

On Slack majority of people voted to:

make df.a .= something replace a present column under Julia 1.7 (it cannot be done under Julia 1.6) - this will essentially mean that we will make a slightly breaking change (towards a better functionality) and have a discrepancy here between Julia 1.6 and 1.7

I will make this change in 1.3 release after Julia 1.7 is out then.

However, as none of the key people (maintaining highly-coupled packages or representatives of the wider user community) seem to have voted there I am calling key people out here for opinion before I go this path (of course others are also welcome to chime in): @nalimilan, @oxinabox, @pdeffebach, @quinnj, @jkrumbiegel.

@pdeffebach
Copy link
Contributor

That question confused me, sorry.

df.a .= something should perform broadcasted setindex and should not allocate.

@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2021

df.a .= something should perform broadcasted setindex! and should not allocate.

This is the current behavior. Are you sure you want the following:

julia> df = DataFrame(x=1:3)
3×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │     1
   2 │     2
   3 │     3

julia> df.x .= 'x'
3-element Vector{Int64}:
 120
 120
 120

julia> df # people typically expect that `:x` column contains 'x' Char values after the previous operation
3×1 DataFrame
 Row │ x
     │ Int64
─────┼───────
   1 │   120
   2 │   120
   3 │   120

julia> df.x .= 1.5 # people typically expect this operation to work
ERROR: InexactError: Int64(1.5)

The point of this proposal is to make .= consistent with setindex! when used with ! as row selector. The current behavior is that .= has the same behavior as when using : as row selector (and apart from the fact that it is inconsistent with other cases when broadcasting does the same as ! row selector this leads to the cases like the ones shown above).

@pdeffebach
Copy link
Contributor

Thank you for reminding me what the problem is.

I'm on the fence, but I really do think that if df.x returns the vanilla Julia vector, then operations with df.x should behave like other vanilla Julia vectors.

Of course, we already have one exception, which is when the column :x doesn't exist in the data frame.

I think overall, we should accomadate people's intuitions about what allocates, rather than intuition about what promotion rules are in other languages. I anticipate more people being frustrated by slowdowns when df.x .= 1 makes a copy than people who are confused by df.x .= 'a'. For this reason I think we should keep default behavior.

By keeping the default vector-like behavior, we don't have to explain our complicated rules as much and just point to the Base Julia docs. The consistency is nice.

That said, you know I imagine the audience of DataFrames.jl to be something like first year MPP students trying to understand data analysis for the first time. So I understand the considerations about type promotion.

@bkamins
Copy link
Member Author

bkamins commented Oct 27, 2021

frustrated by slowdowns when df.x .= 1 makes a copy

Here is the slowdown level:

julia> @benchmark fill(1, 10^6)
BenchmarkTools.Trial: 2548 samples with 1 evaluation.
 Range (min … max):  1.057 ms … 48.541 ms  ┊ GC (min … max):  0.00% … 97.15%
 Time  (median):     1.365 ms              ┊ GC (median):     0.00%
 Time  (mean ± σ):   1.956 ms ±  1.803 ms  ┊ GC (mean ± σ):  31.85% ± 27.19%

    █▆█▅▂                                        ▁▃▂   ▁▁▃▂▁ ▁
  ▇███████▆▄▅▃▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▃▃▃▁▁▁▃▁▆█████▇█████ █
  1.06 ms      Histogram: log(frequency) by time     5.23 ms <

 Memory estimate: 7.63 MiB, allocs estimate: 2.

vs.

julia> x = zeros(Int, 10^6);

julia> @benchmark $x .= 1
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  226.500 μs … 925.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     232.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   238.375 μs ±  23.487 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █ ▄
  ▅███▆▃▃█▅▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  226 μs           Histogram: frequency by time          322 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

so indeed this is slower, but note that the timings of such operations are so fast that I imagine no one would ever notice the slowdown in practice.

@nalimilan
Copy link
Member

It's difficult to discuss intuitions as they vary a lot depending on previous experience, context, etc. IMO the main advantage of this change would be to have df.x be always equivalent to df[!, :x]. Given how complex the rules are, it seems important to avoid having three different syntaxes (df.x, df[:, :x] and df[!, :x]).

The downside is of course that we introduce an inconsistency between df.x .= 1 and v = df.x; v .= 1.

Link to previous discussion (credits to @bkamins): #2655.

@bkamins
Copy link
Member Author

bkamins commented Oct 29, 2021

to have df.x be always equivalent to df[!, :x]

this is my major personal motivation

@bkamins
Copy link
Member Author

bkamins commented Nov 14, 2021

The plan is that we will change this behavior in DataFrames.jl 1.4. In 1.3 release an explicit warning message will be printed that the behavior will change in 1.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ecosystem Issues in DataFrames.jl ecosystem
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants