Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push! which promotes type #1716

Closed
pdeffebach opened this issue Feb 11, 2019 · 11 comments · Fixed by #2152
Closed

push! which promotes type #1716

pdeffebach opened this issue Feb 11, 2019 · 11 comments · Fixed by #2152
Labels
non-breaking The proposed change is not breaking
Milestone

Comments

@pdeffebach
Copy link
Contributor

Currently push!ing something to a DataFrame just calls push! on each individual vectors. This means that if a new row contains missing, it will throw an error.

Obviously the user can call allowmissing! on columns before they start using push!, but I wonder if we should have function that does this type promotion automatically.

Maybe we should deprecate push! and use only append! and vcat? Then have both functions work with NamedTuples, DataFrameRows etc.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

Actually I like push! exactly because it does not do any magic 😄 - and it helps me catch bugs.
But we can have different functions with different behaviors.

Does append! do auto promotion (as I do not remember if it does)?

@nalimilan
Copy link
Member

append! doesn't promote either. I've already wondered whether that was a good idea. The main advantage I can find is that if you push/append repeatedly, growing existing vectors is more efficient that making a copy (since push! grows exponentially to avoid copying on each call).

Related to #1695 (maybe even a duplicate).

@pdeffebach
Copy link
Contributor Author

My impression is that append! is kind of like a performant vcat, and not promoting types is part of why it's performant.

Deprecating push! makes sense because I think that append!(v1::Vector, a::Any) covers all the behavior we would want from push!

The behavior of push! and append! for vectors differ for vectors of vectors.

t = Any[[1, 2], [3, 4]]
s = copy(t)

julia> push!(t, [5, 6])
3-element Array{Any,1}:
 [1, 2]
 [3, 4]
 [5, 6]

julia> append!(s, [5, 6])
4-element Array{Any,1}:
  [1, 2]
  [3, 4]
 5
 6

We don't have heterogenous element types in DataFrames (every element is a row) so there is not ambiguity about what to add.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

Actually I would leave push! as is because it does exactly what the contract for push! in Base specifies.

But append! could be extended to allow passing it an iterable and it would try to push the elements of this iterable as rows to the DataFrame. We do not support this fully now, but it would be consistent with the contract for append! in Base.

@nalimilan
Copy link
Member

I agree we need to keep both append! and push! for consistency with Base, as they have very clear definitions. An example of a possible ambiguity between them would be, if we follow the Tables.jl approach, appending a vector of named tuples should add one row per tuple, while pushing it should add one row with named tuples as entries (not saying we need to support this, but that's a possibility).

@pdeffebach
Copy link
Contributor Author

What about adding a method for vcat to accept a named tuple or even just plain vector? I want a way to add a new row without worrying about errors due to missing.

My intuition is that for missing-heavy data manipulation, like I work with consistently and which I would bet is the largest audience of DataFrames, it's reasonable to push people towards a super flexible vcat where they don't have to worry about types.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

while pushing it should add one row with named tuples as entries

This is exactly what we support now. And your example is general the reason why they need to stay different (we have the same duality in setindex! discussion in #1646)

appending a vector of named tuples should add one row per tuple

I am OK with this (again along the setindex! rule of thumb that if something could be converted to DataFrame using the constructor we could support it also without calling the constructor to get the same result)

@pdeffebach If I understand what @nalimilan said in #1695 correctly we could consider adding push!! and append!! methods that would perform autopromotion.

However, if it is only the case about missing then I think the reasonable thing is to say to people to use eltypes that allow missing (e.g. I guess this is the reason why CSV.jl does this by default) and learn allowmissing! in general.

@nalimilan
Copy link
Member

What about adding a method for vcat to accept a named tuple or even just plain vector? I want a way to add a new row without worrying about errors due to missing.

Yes, why not. A vector of named tuples would have to be interpreted as several rows I guess (in practice I don't think it matters a lot).

@pdeffebach If I understand what @nalimilan said in #1695 correctly we could consider adding push!! and append!! methods that would perform autopromotion.

Actually that's the opposite. :-)

@pdeffebach
Copy link
Contributor Author

Okay I will see if this works with my current open PR.

I wonder if all this stuff should be just pushed off to Tables.jl since it's about the relationship between arrays of named tuples and tables.

@bkamins
Copy link
Member

bkamins commented Feb 11, 2019

Actually that's the opposite. :-)

So is your proposal to make push! to do autopromotion (and create a new vector) and push!! to mutate the vector in place? Then I guess we should list out what would have changed (but let us move this discussion to #1695), because I find it counter-intuitive as push! and append! in base do not perform autopromotion.

@bkamins
Copy link
Member

bkamins commented Jul 26, 2019

This is what I propose:

  • leave append! and push! as is (as they conform to API specification form Base and the notion of row-orientation of data frames)
  • allow append! to take any Tables.jl conforming second argument (there is no ambiguity here)
  • allow vcat with first argument being a DataFrame to take any Tables.jl conforming second argument (similarly to append!)

If we are OK with this I can propose a PR doing this. Please let me know what you think.

@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 2.0 milestone Feb 12, 2020
@bkamins bkamins linked a pull request Mar 21, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants