-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of reading files with duplicate column names #955
Conversation
In the next commit, we'll be changing the code responsible for naming duplicate columns and these tests should ensure that the behavior doesn't change.
Codecov Report
@@ Coverage Diff @@
## main #955 +/- ##
==========================================
+ Coverage 90.27% 90.28% +0.01%
==========================================
Files 9 9
Lines 2303 2306 +3
==========================================
+ Hits 2079 2082 +3
Misses 224 224
Continue to review full report at Codecov.
|
src/utils.jl
Outdated
@@ -349,17 +349,21 @@ function makeunique(names) | |||
set = Set(names) | |||
length(set) == length(names) && return Symbol[Symbol(x) for x in names] | |||
nms = Symbol[] | |||
nmsset = Set{eltype(names)}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me, without having actually tried to code it out, that we could avoid needing this separate nmsset
and just re-use the set
from above? And then we would check if nm in set
below, then just while newnm in set
, then finally do push!(set, nm)
. That way we avoid having two Set
s that basically do the same thing?
I remember this code being a little subtle in a few cases though, so perhaps I'm forgetting an oddity that prevents us doing it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also thinking about that, but that would change how the columns are named in some cases. And this could break existing code relying on the uniquified names.
Consider the following code:
CSV.makeunique([:a :a])
Without nmsset
, as you propose, the result is:
2-element Vector{Symbol}:
:a_1
:a_2
while with nmsset
it is:
2-element Vector{Symbol}:
:a
:a_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it will be possible to use suffixes
Dict instead of nmsset
. The result will be slightly shorter code. I'm currently testing it and push it soon.
d226d42
to
6a32f67
Compare
src/utils.jl
Outdated
for nm in names | ||
if nm in nms | ||
k = 1 | ||
if nm in keys(nextsuffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if nm in keys(nextsuffix) | |
if haskey(nextsuffix, nm) |
using haskey
is more idiomatic
src/utils.jl
Outdated
newnm = Symbol("$(nm)_$k") | ||
while newnm in set || newnm in nms | ||
while newnm in set || newnm in keys(nextsuffix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while newnm in set || newnm in keys(nextsuffix) | |
while newnm in set || haskey(nextsuffix, newnm) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for suggestion. Updated.
I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity. This commit changes the algorithm to use a Dict to quickly look up the existence of columns and to cache the next numeric suffix used to uniquify column names. Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the previous commit.
6a32f67
to
d12300e
Compare
Thanks @wentasah! |
I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because
makeunique()
, which produces unique column names, has cubic complexity.This changes the algorithm to use Set and Dict to quickly look up the existence of columns and to cache the last numeric suffix used to uniquify column names.
Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the first commit.
makeunique
was updated in the second commit to make it easy to verify that the tests pass both before and after the change.