Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-place broadcast assignment #3206

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

gustafsson
Copy link
Contributor

@gustafsson gustafsson commented Oct 25, 2022

df.col = scalar and df[:, col] = scalar are disallowed from DataFrames. My understanding is that this single rule is an unnecessary limitation that, combined with an ambition to make the API otherwise easy to use, has unfortunately cascaded into indexing and property rules that are unique (i.e df.col .= ... behave like nothing else) as well as deviating from the intended behaviour described in Base Julia:

If ... = scalar was implemented one could greatly simplify the indexing rules and avoid surprises for both beginners and advanced users alike. Improving consistency and ease-of-use. The changes in this PR are:

  1. Let = allocate new columns by copying the RHS. Scalars fill.
  2. Use .= for in-place assignment.
  3. @alias df.col = ... provides explicit aliasing. This is a new exported macro. Unintended aliasing of columns (where df.a === df.b) has been a common source of bugs (previously suggested to be called @nocopy before it was dropped).
  4. df[r, ...] = assignment into a subset of rows promote column types to store both old and new values.
  5. In-place assignment .= to something that doesn't exist should error. Incompatible eltypes also error with .=. I think that a user advanced enough to bother with in-place assignment care about precise behaviour. In a migration period these could still be allowed with a deprecation message "consider using = instead of .=" since DataFrames users are currently told to use .= when they expect to = to work.

These changes would make = behave like .= currently does, for d[!,..] and df.col. Unfortunately this is a breaking change for code that relies on either of these (albeit obscure) features.

  • Aliasing into a DataFrame. Replace df.col = v with @alias df.col = v if df.col === v is assumed true after the assignment.
  • broadcast assignment that changes the type of the LHS. Could be allowed with a deprectation message "Replace .= with =".
  • broadcast assignment to new columns. Could be allowed with a deprectation message "Replace .= with =".

In addition to my own code base I tested my changes on these repos:

  • Agents.jl - One unrelated unit test error with DataFrames v1.4 and an identical error with these changes.
  • DataFrameMacros.jl - No unit test errors.
  • DataFramesMeta.jl - A single unit test did df.col = v and later tested identicality with v. Patched with @alias df.col = v. The implementation outside of unit tests did not need to be patched.
  • StatsKit.jl - No unit test errors.

"Wouldn't ... = scalar just break another rule instead?"

Whereas .= is supposed to always be in-place the = may on the other hand convert and copy according to Julia docs.

  • I also interpret the intention of the rules as allowing an implementation to choose to always copy on = even if the LHS and RHS types are the same.
  • Converting a scalar to an AbstractVector in a context where the length is known is not breaking any rules in my interpretation. One could conceivably use a "SameValueVector" of sorts which would defer the fill until later (i.e until df[row, col] = ...)

Additional steps not currently included in this PR:

  1. A deprecation path where the old behaviour is still allowed if the eltype assigned with .= isn't identical: "consider using = instead of .="
  2. Support aliasing with @alias df[:, ...] = v. It would then be possible to remove df[!, ...] from the LHS without loss of functionality (as far as I can tell). That could imho lessen confusion about the multitude of overlapping indexing rules.
  3. Aliasing out of a DataFrame could use the same macro @alias v = df[:, ...]. This would make it possible to remove df[!, ...] from the RHS without loss of functionality.

By conforming to Base Julia the complete indexing rules would then be greatly simplified to just:

Complete indexing rules

A DataFrame can be indexed like a Matrix with df[...] (or @view df[...]) with only one exception:

  • = always replaces existing columns of a DataFrame with new copies (whereas it's in-place on a subset of a Matrix).

To work with columns of a data frame:

  • Columns can be referred to by name with a symbol or string.
  • = allow the creation of new columns by name.
  • df.col and df."col" provide struct-like access to a column where col is the name of a column.
  • Columns can also be referred to with selectors like Not(col), Between(firstcol, lastcol), Cols(...) or All().
  • Aliasing of whole columns: @alias df.col = x stores x instead of a copy of x so df.col === x. @alias is not applicable on the RHS where df.col by itself provides an alias.
  • Aliasing of whole columns: @alias df[:, ...] does not copy. Neither on the RHS nor the LHS.

Example

After this PR a DataFrame would behave like this:

julia> using DataFrames

julia> df = DataFrame(x = 1:3);

julia> # Assignment with setproperty! fills were applicable, for convenience as well as appeasing newcomers
       df.a = 1
1

julia> df
3×2 DataFrame
 Row │ x      a     
     │ Int64  Int64 
─────┼──────────────
   11      1
   22      1
   33      1

julia> # Assignment does not create an alias since unintented aliases are a common source of bugs
       df.b = df.a;

julia> df.b === df.a
false

julia> # Intentional aliases are fine though
       @alias df.b = df.a;

julia> df.b === df.a
true

julia> # Broadcasted assignment .= is in-place
       df.b .= 2;

julia> df.b === df.a
true

julia> # In-place assignment errors if `c` doesn't exists
       df.c .= 1
ERROR: ArgumentError: column name "c" not found in the data frame; existing most similar names are: "a", "b" and "x"

julia> # Assigning an incompatible type in-place fails
       df.b .= "3"
ERROR: MethodError: Cannot `convert` an object of type String to an object of type Int64

julia> # Assigning not in-place replaces and fills
       df.b = "3"
"3"

julia> df
3×3 DataFrame
 Row │ x      a      b      
     │ Int64  Int64  String 
─────┼──────────────────────
   11      2  3
   22      2  3
   33      2  3

julia> # Since aliasing is now explicit one may as well allow AbstractRange if that's what explicitly asked for
       @alias df.c = 1:3;

julia> df.c
1:3

julia> # Not explicitly aliasing an AbstractRange behaves as before
       df.c = 1:3;

julia> df.c
3-element Vector{Int64}:
 1
 2
 3

@gustafsson
Copy link
Contributor Author

This PR was inspired by #3200 - "Why isn't df.col .= v in-place?"

@gustafsson gustafsson force-pushed the in-place-broadcast-assignment branch from 8927b49 to 47e2cd5 Compare October 25, 2022 14:02
@gustafsson gustafsson force-pushed the in-place-broadcast-assignment branch from 47e2cd5 to d681f96 Compare October 25, 2022 14:08
@bkamins bkamins added decision feature breaking The proposed change is breaking. labels Oct 25, 2022
@bkamins bkamins added this to the 2.0 milestone Oct 25, 2022
@bkamins
Copy link
Member

bkamins commented Oct 25, 2022

Thank you for your thoughtful PR. As you have noted in #3200 this change is to be considered in 2.0 release as it introduces changes that are significant.

Let us now wait for the user/maintainers feedback on the proposal.

@gustafsson gustafsson marked this pull request as draft November 2, 2022 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking The proposed change is breaking. decision feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants