-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: implementation of new setindex! and broadcasting rules #1646
Conversation
docs/src/lib/indexing.md
Outdated
> `df[rows, col] = v` | ||
|
||
* The same rules as for `df[col] = v` but on selected rows and always with copying. | ||
* Empty data frames are not allowed and no column adding is possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe df[:, col] = v
and/or df[:, col] .= v
should allow empty data frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have df[col] = v
and df[col] .= v
for this case. It would render df[:, col] = v
inconsistent.
And here I exactly try to fix the inconsistency in the current design of DataFrames.jl. See the example how x
and y
are treated differently:
julia> x = [1,2,3]
3-element Array{Int64,1}:
1
2
3
julia> y = [1,2,3]
3-element Array{Int64,1}:
1
2
3
julia> df = DataFrame(x=x,y=y)
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
julia> df[:, :x] = [0,0,0]
3-element Array{Int64,1}:
0
0
0
julia> df[1:3, :y] = [0,0,0]
3-element Array{Int64,1}:
0
0
0
julia> df
3×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 0 │ 0 │
│ 2 │ 0 │ 0 │
│ 3 │ 0 │ 0 │
julia> x
3-element Array{Int64,1}:
1
2
3
julia> y
3-element Array{Int64,1}:
0
0
0
julia> df[:, :x] = 'a':'c'
'a':1:'c'
julia> df
3×2 DataFrame
│ Row │ x │ y │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 0 │
│ 2 │ 'b' │ 0 │
│ 3 │ 'c' │ 0 │
julia> df[1:3, :y] = 'a':'c'
'a':1:'c'
julia> df
3×2 DataFrame
│ Row │ x │ y │
│ │ Char │ Int64 │
├─────┼──────┼───────┤
│ 1 │ 'a' │ 97 │
│ 2 │ 'b' │ 98 │
│ 3 │ 'c' │ 99 │
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is problematic, but I don't see the relationship with empty data frames (it's been a long time....).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed rule is that if we use df[:, col]
or df[rows, col]
in general we write to an existing vector in the DataFrame
in-place. So - naturally it has to fail in an empty DataFrame
.
Or - from another angle - the proposed rule is that on LHS df[:, col]
is rewritten as df[axes(df, 1), col]
.
This approach is consistent with Base, where the manual states here:
This includes
Colon
(:
) to select all indices within the entire dimension
I would be OK to bend the rules in this case, if there were no other convenient way to assign a new column to a DataFrame
, but there is one and it is df[col] = v
.
But if you really think it is useful we can make an exception for an empty DataFrame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if we have e.g.
df = DataFrame(x=[])
df[:, 1] = []
Then what's the problem? We've set all entries in df.x
to the values from the RHS -- it's just that there aren't any of them. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no problem 😄. Your df
is not empty, it has 0-rows, but it is a different story. An empty data frame is returned by the call DataFrame()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this distinction of emptyness is very relevant in other cases.
This should work:
df = DataFrame()
df.a = [1,2,3]
while this should fail
df = DataFrame(x=[])
df.a = [1,2,3]
An interesting corner case is:
julia> df = DataFrame(x=[1,2])
2×1 DataFrame
│ Row │ x │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
julia> df.x = [1,2,3,4]
ERROR: ArgumentError: New columns must have the same length as old columns
which fails now and I thing it should still fail.
The short rule - an empty data frame DataFrame()
has an undefined number of rows (not 0
) while any non-empty data frame has a concrete number of rows.
Actually we could consider to make nrow(DataFrame())
return nothing
instead of 0
, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then maybe say "with zero columns" to avoid any possible ambiguity.
Actually we could consider to make
nrow(DataFrame())
returnnothing
instead of0
, this would be technically stricter, but I always thought it would not be very useful, so I have not proposed to add this change.
Or even missing
, since it's really unknown. ;-) But 0
is probably the most useful answer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - I will replace empty with zero columns everywhere to disambiguate.
docs/src/lib/indexing.md
Outdated
(which means that passing `Integer` or `Bool` will fail for nonexistent columns, | ||
but adding new columns as `Symbol` is allowed) | ||
* then `df[col] = v[col]` is called for each column name `col` | ||
* if `v` is a vector of vectors or a tuple of vectors the same process is performed but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this rule. I do not know if we want to keep it (or keep it only for vectors and not for tuples)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds fine. Am I right that this doesn't introduce any ambiguity if you want a column which is a vector of vectors (since you would have to wrap it in a vector/tuple/data frame anyway since this is a multi-column assignment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - you would have to wrap it. Single column assignment is another rule.
Actually this is very delicate. My idea is that df[cols] = v
should ideally give the same result as df[cols] = DataFrame(v)
just without having to materialize the intermediate DataFrame
.
The same for df[rows, cols] = v
.
But now I see that:
julia> x = ([1,2,3], [1,2,3])
([1, 2, 3], [1, 2, 3])
julia> DataFrame(x)
ERROR: ArgumentError: 'Tuple{Array{Int64,1},Array{Int64,1}}' iterates 'Array{Int64,1}' values, which don't satisfy the Tables.jl Row-iterator interface
julia> x = [[1,2,3], [1,2,3]]
2-element Array{Array{Int64,1},1}:
[1, 2, 3]
[1, 2, 3]
julia> DataFrame(x)
3×2 DataFrame
│ Row │ x1 │ x2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
which is quite surprising (in general the Tables.jl & friends catch all DataFrame
constructor always catches me off guard 😞). So I think we should drop tuple here and leave only the vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't Tables.jl accept a tuple here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It accepts it, but assumes that the contents is row-wise not col-wise, see e.g.:
julia> DataFrame(((a=1,b=2),(a=3,b=4)))
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
That is why I say that this Tables.jl API is tricky when combined with what we already had in DataFrames.jl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Following your suggestion I have commented on the constructors in a separate issue DataFrame constructors #1599.
- Are you on latest versions of all packages from Tables.jl & friends? For me it works:
julia> DataFrame([(a=1,b=2),(a=3,b=4)])
2×2 DataFrame
│ Row │ a │ b │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │
│ 2 │ 3 │ 4 │
(and even if it did not work it would not be an error in DataFrames.jl 😀)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, actually I tested that on my PooledArrays branch which was behind master. It works now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the discussion in #1599 (comment) I changed my mind and would leave only vector of vectors here for now.
The resason is that in general it is not clear if you have a collection if it should be treated row-wise or col-wise. We can add other values of v
later if there is a need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion was that the collection type should not matter to decide whether it's row- or column-wise: tuples and vectors should be treated the same. Two constructors would take tuples and vectors of vectors. Other collections would be assumed to be row-oriented and handled by Tables.jl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK :). So I add tuples here and in the constructors thread.
docs/src/lib/indexing.md
Outdated
* `length(v)` must be equal to `length(cols)` | ||
* column names in `v` must be the same as selected by `cols` | ||
* an operation `df[row, col] = v[col]` for `col in cols` is performed | ||
* if `v` is a vector or a tuple: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this additional rule is related to this rule.
Regarding the specification in
This doesn't seem to be the case right now (maybe you know this?) even when there is no issue with empty DataFrames. Here's an example where
Note: Run in Julia 1.1.0 using DataFrames v. 0.17.1 |
This PR describes what should happen, not what happens right now. |
Co-Authored-By: bkamins <[email protected]>
Co-Authored-By: bkamins <[email protected]>
7e590b0
to
7fa4896
Compare
When implementing the What I mean is that assignment to
Additionally I am not sure how useful having @nalimilan, @oxinabox - do you have any thoughts on it (if it is not clear what I mean here please comment and I can expand on my thoughts). |
Just to give more perspective to it I am not convinced we want to accept any of the following:
and in general now I feel that allowing a data frame on the RHS of an assignment is problematic. What we could allow for (but I am not sure how much useful that would be so that is why I am hesitant to propose these - we can always add them later) are:
|
#1866 implements current rules |
@nalimilan Following #1645 I have written down the possible target rules, so that we can edit them.
I will add a specification for
SubDataFrame
andDataFrameRow
when we settleAbstractDataFrame
.CC @coreywoodfield