Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Small no-op refactor before functionality change #2681

Closed
wants to merge 4 commits into from

Conversation

sa-
Copy link
Contributor

@sa- sa- commented Mar 26, 2021

In preparation for #2423 I'm making a no-op refactor.

I would feel comfortable going through the process of contributing with a small no-op change like this one before making any functionality changes.

ColumnIndex, MultiColumnIndex}...);
keepkeys::Bool, ungroup::Bool, copycols::Bool,
keeprows::Bool, renamecols::Bool)
if !ungroup && !keepkeys
throw(ArgumentError("keepkeys=false when ungroup=false is not allowed"))
end

cs_vec = []
for p in cs
processed_combine_ops = map(combine_operations) do p
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not want to use map here to avoid excessive specialization. Please leave the Vector{Any} as a container for processed_combine_ops.

@@ -65,7 +62,9 @@ function _combine_prepare(gd::GroupedDataFrame,

idx, valscat = _combine(gd, cs_norm, optional_transform, copycols, keeprows, renamecols)

!keepkeys && ungroup && return valscat
if !keepkeys && ungroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the previous code was idiomatic Julia style. Why do you want to change it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, good to know

@@ -40,6 +40,7 @@ julia = "1"

[extras]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
Chain = "8be319e6-bccf-4806-a6f7-6fae938471bc"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not see it used in this PR

@bkamins
Copy link
Member

bkamins commented Mar 26, 2021

Also looking by the CI errors you have changed the logic somewhere (I did not dig into it as I understand this is still a draft).

else
push!(cs_vec, p)
p
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is incorrect as you mix vector and scalar values in it. Note the distinction between append! and push! in the original code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that.. from what I read the functionality looked similar. How do the functions work differently under the hood?

@bkamins
Copy link
Member

bkamins commented Mar 26, 2021

In summary: thank you very much for working on it. I think that the refactorizations that improve code readability are very useful and we should go for them.

However, I would recommend you to start with refactorizations that do not change the logic of the code (like here you tried to change for loop to map and this change has a significant impact on the code logic).

Of course I am not opposed to such changes in general in the future - if they would improve things, but they are much more tricky than just no-op refactorizations, so it is probably better to do them when you get more experience with the codebase.

@bkamins bkamins added this to the 1.x milestone Mar 26, 2021
@bkamins
Copy link
Member

bkamins commented Mar 26, 2021

Also - in general - especially to start with - it is best if you push PR that pass tests locally with:

]test DataFrames

In the future you will notice that this is not a strict rule as sometimes one pushes a PR that fails to discuss some design ideas, but for start I think it is useful to build a habit that commits are pushed to GitHub when they pass tests locally.

Copy link
Contributor Author

@sa- sa- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, I wasn't expecting anyone to look at this but I appreciate that you did anyway!

Some thoughts -

  • Makes sense that the refactoring would be a good idea after I'm comfortable with the codebase
  • I was hoping that my refactoring changes would not make any functional changes but that seems harder to do than I initially thought e.g. I didn't think changing for to map would make a difference but I've clearly broken something.

else
push!(cs_vec, p)
p
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that.. from what I read the functionality looked similar. How do the functions work differently under the hood?

@@ -65,7 +62,9 @@ function _combine_prepare(gd::GroupedDataFrame,

idx, valscat = _combine(gd, cs_norm, optional_transform, copycols, keeprows, renamecols)

!keepkeys && ungroup && return valscat
if !keepkeys && ungroup
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, good to know

@sa- sa- closed this Mar 27, 2021
@bkamins
Copy link
Member

bkamins commented Mar 27, 2021

I didn't think changing for to map would make a difference but I've clearly broken something.

Here is an example. Current code:

julia> cs = (nrow, [1,2], 3, [1 => 4,  2 => 5])
(DataFrames.nrow, [1, 2], 3, [1 => 4, 2 => 5])

julia> vec_cs = Any[]
Any[]

julia> for p in cs
           if p === nrow
               push!(vec_cs, nrow => :nrow)
           elseif p isa AbstractVecOrMat{<:Pair}
               append!(vec_cs, p)
           else
               push!(vec_cs, p)
           end
       end

julia> vec_cs
5-element Vector{Any}:
 DataFrames.nrow => :nrow
                  [1, 2]
                 3
               1 => 4
               2 => 5

your code:

julia> map(cs) do p
           if p === nrow
               nrow => :nrow
           else
               p
           end
       end
(DataFrames.nrow => :nrow, [1, 2], 3, [1 => 4, 2 => 5])

So you can see that:

  1. type of the return value changed
  2. structure of the collection changed

A more advanced comment was about specialization. Methods for Vector{Any} are precompiled in Julia Base, so they do not trigger specialization. If you use map a new Tuple is created. As Tuple has a type that depends on all its elements the whole function has to be recompiled on each call, while the original function had to be compiled only once (on the first call).

I wasn't expecting anyone to look at this but I appreciate that you did anyway!

I assumed you wanted to get some feedback to learn a bit about the design concepts behind DataFrames.jl internals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants