-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make getproperty(df, col)
return a full length view of the column
#1844
Comments
There would be still one hole:
and then resize Still - patching up things is not that bad :). Actually this would be consistent with An alternative (and this is something we really should discuss here - as we will always have holes) is to create a list of methods in DataFrames.jl that actively check if all columns have the same length before performing their operation (essentially - all expensive methods). |
Those are 2 good points, lets open seperate issues to discuss them. Openned: |
Thanks for opening the issues (they will end up in on PR probably later, but it is easier to manage the discussion this way). Let us wait what other people think Following up #1846 (comment) here it would be related to the fact that
Now naturally |
Return as is:Pros return as is
Cons return as is:
Returning copyPros returning copy:
Cons returning copy
Return a viewPros return a view
Cons return a view
|
Thinking a bit more. More out there idea:
Neither of the last two points have to be done immediately, but they would not be breaking changes. |
Having I'm hesitant about |
|
So is the conclusion to return a view and add instructions to the users that they can later As a side note then Also maybe if we go this way in one shot we should deprecate |
Yeah, let's do that. The nice thing with that change is that we could even revert it without breaking anything if needed. Though to replace (Then we may even remove |
|
Changing return types is breaking. |
Side note: when this change is implemented whole codebase of DataFrames.jl should be reviewed as there are places that internally assume that |
Given the recent discussion #1856 will be probably closed (unless some new comments are raised) and the following rules will be implemented:
Also note that this in particular means that Partially related change:
The deprecations of If we are OK with this I will open a new PR implementing this (given we have agreement on the four questions I asked above). |
I still think having |
Given #1846 (comment) the plan would be for In what scenarios do you think
and people assume it would update |
That, but I'd be mainly concerned about the performance impact of copying on common operations like |
Having |
After #1866 |
Should we close this then? Or do we still want to do this for 1.0? I think I'd actually be ok w/ doing something like: struct DataFrameColumn{T} <: AbstractVector{T}
source::T
end which basically supported normal AbstractArray interface, |
I want that because we can later and more functionality around it. |
The major stopper for using one or the other approach was that this design is not compatible with CategoricalArrays.jl (at least currently). All methods using
Let us wait for @nalimilan to comment on this. Also, my personal view (but not a very strong one) is that when we have #1845 the problem with possible resizing of underlying vectors will be reduced so this is not a must-be functionality. |
Also - if we go for some kind of wrapper I would love to have:
in DataAPI.jl so that we do not populate "data ecosystem" with numerous types that do very similar thing (also I think it would simplify their adoption as people would learn to use them in different contexts). |
xref: https://discourse.julialang.org/t/release-announcements-for-dataframes-jl/18258/61 (as I think it summarizes why I slightly lean towards allowing an access to an unwrapped vector using |
I also think the current (new) design is OK. Wrapping column vector in a wrapper would be overkill and confusing for users, for a limited gain. The |
I think we have settled for the design that returns the column not its view, so closing this issue (please reopen if you feel it should be discussed more). |
Similar for the single arg
getindex
(getindex(df, col::Symbol)
), andeachcol
.We are progressively knocking out ways of ending up with columns of different sizes.
Right now here are ways that you can't end up with incosistant sizes:
setproperty
checks the size.Vector
then resize it using another reference, as theDataFrame
constructor copies.@view
indexing returnsSubDataFrame
s which disallow size mutating operations, and who's column vectors are views anyway, so also disallow resizing operations (right?)setproperty
(and 1 arg setindex) check the size of the column being added, you can't just insert one with incorrect size.Right now the only way I can think of getting access to a inner column and then mutating its size,
is the use of
getproperty
or single argumentgetindex
oreachcol
.Which return the actually underlying
Vector
.And I was thinking: It would be great if we could just wrap those in some kind of
view
like array wrapper that doesn't allow resizing in place, but allows all the other operations one would hope from aVector
, including in-placesetindex
of elements.Turns out such a view likes wrapper does exist.
The
SubArray
.We can just return a the equivalent of
@view df.col[:]
.The overhead of creating a view is tiny, as is the overhead of working with a view (at least when the indexing is simple which it is in this case).
You can do all the things to a
SubArray
as long as you don't callresize!
on it -- that is aMethodError
.If someone needs to actually access the raw array, then they can use
parent(df.col)
.Only reason I can think that would really be needed is for if a method has a overly strict set of type constraints, then accessing the
parent
would be an alternative tocollect
(with its own trade-offs).I think this would knock off the last possible way to end up with a corrupt
DataFrame
,via column related shenanigans.
The text was updated successfully, but these errors were encountered: