Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of reading files with duplicate column names #955

Merged
merged 2 commits into from
Dec 24, 2021

Conversation

wentasah
Copy link
Contributor

I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity.

This changes the algorithm to use Set and Dict to quickly look up the existence of columns and to cache the last numeric suffix used to uniquify column names.

Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the first commit. makeunique was updated in the second commit to make it easy to verify that the tests pass both before and after the change.

In the next commit, we'll be changing the code responsible for naming
duplicate columns and these tests should ensure that the behavior
doesn't change.
@codecov
Copy link

codecov bot commented Dec 22, 2021

Codecov Report

Merging #955 (d12300e) into main (d25992a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #955      +/-   ##
==========================================
+ Coverage   90.27%   90.28%   +0.01%     
==========================================
  Files           9        9              
  Lines        2303     2306       +3     
==========================================
+ Hits         2079     2082       +3     
  Misses        224      224              
Impacted Files Coverage Δ
src/utils.jl 86.41% <100.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d25992a...d12300e. Read the comment docs.

src/utils.jl Outdated
@@ -349,17 +349,21 @@ function makeunique(names)
set = Set(names)
length(set) == length(names) && return Symbol[Symbol(x) for x in names]
nms = Symbol[]
nmsset = Set{eltype(names)}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me, without having actually tried to code it out, that we could avoid needing this separate nmsset and just re-use the set from above? And then we would check if nm in set below, then just while newnm in set, then finally do push!(set, nm). That way we avoid having two Sets that basically do the same thing?

I remember this code being a little subtle in a few cases though, so perhaps I'm forgetting an oddity that prevents us doing it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also thinking about that, but that would change how the columns are named in some cases. And this could break existing code relying on the uniquified names.

Consider the following code:

CSV.makeunique([:a :a])

Without nmsset, as you propose, the result is:

2-element Vector{Symbol}:
 :a_1
 :a_2

while with nmsset it is:

2-element Vector{Symbol}:
 :a
 :a_1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it will be possible to use suffixes Dict instead of nmsset. The result will be slightly shorter code. I'm currently testing it and push it soon.

src/utils.jl Outdated
for nm in names
if nm in nms
k = 1
if nm in keys(nextsuffix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if nm in keys(nextsuffix)
if haskey(nextsuffix, nm)

using haskey is more idiomatic

src/utils.jl Outdated
newnm = Symbol("$(nm)_$k")
while newnm in set || newnm in nms
while newnm in set || newnm in keys(nextsuffix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while newnm in set || newnm in keys(nextsuffix)
while newnm in set || haskey(nextsuffix, newnm)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggestion. Updated.

I need to load a file with 30k columns, 10k of these have the same
name. Currently, this is practically impossible because makeunique(),
which produces unique column names, has cubic complexity.

This commit changes the algorithm to use a Dict to quickly look up the
existence of columns and to cache the next numeric suffix used to
uniquify column names.

Care has been taken to ensure that columns are named the same way as
before. To that extent, additional tests were added in the previous
commit.
@quinnj quinnj merged commit 482a187 into JuliaData:main Dec 24, 2021
@quinnj
Copy link
Member

quinnj commented Dec 24, 2021

Thanks @wentasah!

@wentasah wentasah deleted the dup-col-names-perf branch January 27, 2022 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants