Improve performance of reading files with duplicate column names #955

wentasah · 2021-12-21T17:07:49Z

I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity.

This changes the algorithm to use Set and Dict to quickly look up the existence of columns and to cache the last numeric suffix used to uniquify column names.

Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the first commit. makeunique was updated in the second commit to make it easy to verify that the tests pass both before and after the change.

In the next commit, we'll be changing the code responsible for naming duplicate columns and these tests should ensure that the behavior doesn't change.

codecov · 2021-12-22T20:24:09Z

Codecov Report

Merging #955 (d12300e) into main (d25992a) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #955      +/-   ##
==========================================
+ Coverage   90.27%   90.28%   +0.01%     
==========================================
  Files           9        9              
  Lines        2303     2306       +3     
==========================================
+ Hits         2079     2082       +3     
  Misses        224      224

Impacted Files	Coverage Δ
src/utils.jl	`86.41% <100.00%> (+0.11%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d25992a...d12300e. Read the comment docs.

quinnj · 2021-12-22T20:26:11Z

src/utils.jl

@@ -349,17 +349,21 @@ function makeunique(names)
    set = Set(names)
    length(set) == length(names) && return Symbol[Symbol(x) for x in names]
    nms = Symbol[]
+    nmsset = Set{eltype(names)}()


It seems to me, without having actually tried to code it out, that we could avoid needing this separate nmsset and just re-use the set from above? And then we would check if nm in set below, then just while newnm in set, then finally do push!(set, nm). That way we avoid having two Sets that basically do the same thing?

I remember this code being a little subtle in a few cases though, so perhaps I'm forgetting an oddity that prevents us doing it that way.

I was also thinking about that, but that would change how the columns are named in some cases. And this could break existing code relying on the uniquified names.

Consider the following code:

CSV.makeunique([:a :a])

Without nmsset, as you propose, the result is:

2-element Vector{Symbol}: :a_1 :a_2

while with nmsset it is:

2-element Vector{Symbol}: :a :a_1

It seems it will be possible to use suffixes Dict instead of nmsset. The result will be slightly shorter code. I'm currently testing it and push it soon.

quinnj · 2021-12-23T20:41:48Z

src/utils.jl

    for nm in names
-        if nm in nms
-            k = 1
+        if nm in keys(nextsuffix)


Suggested change

if nm in keys(nextsuffix)

if haskey(nextsuffix, nm)

using haskey is more idiomatic

quinnj · 2021-12-23T20:42:53Z

src/utils.jl

            newnm = Symbol("$(nm)_$k")
-            while newnm in set || newnm in nms
+            while newnm in set || newnm in keys(nextsuffix)


Suggested change

while newnm in set || newnm in keys(nextsuffix)

while newnm in set || haskey(nextsuffix, newnm)

Thanks for suggestion. Updated.

I need to load a file with 30k columns, 10k of these have the same name. Currently, this is practically impossible because makeunique(), which produces unique column names, has cubic complexity. This commit changes the algorithm to use a Dict to quickly look up the existence of columns and to cache the next numeric suffix used to uniquify column names. Care has been taken to ensure that columns are named the same way as before. To that extent, additional tests were added in the previous commit.

quinnj · 2021-12-24T18:47:15Z

Thanks @wentasah!

Add tests for duplicate column name handling

2fd493b

In the next commit, we'll be changing the code responsible for naming duplicate columns and these tests should ensure that the behavior doesn't change.

quinnj reviewed Dec 22, 2021

View reviewed changes

wentasah force-pushed the dup-col-names-perf branch from d226d42 to 6a32f67 Compare December 23, 2021 16:38

quinnj reviewed Dec 23, 2021

View reviewed changes

wentasah force-pushed the dup-col-names-perf branch from 6a32f67 to d12300e Compare December 23, 2021 22:58

quinnj merged commit 482a187 into JuliaData:main Dec 24, 2021

wentasah deleted the dup-col-names-perf branch January 27, 2022 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of reading files with duplicate column names #955

Improve performance of reading files with duplicate column names #955

wentasah commented Dec 21, 2021

codecov bot commented Dec 22, 2021 •

edited

Loading

quinnj Dec 22, 2021

wentasah Dec 22, 2021

wentasah Dec 23, 2021

quinnj Dec 23, 2021

quinnj Dec 23, 2021

wentasah Dec 23, 2021

quinnj commented Dec 24, 2021

	while newnm in set \|\| newnm in keys(nextsuffix)
	while newnm in set \|\| haskey(nextsuffix, newnm)

Improve performance of reading files with duplicate column names #955

Improve performance of reading files with duplicate column names #955

Conversation

wentasah commented Dec 21, 2021

codecov bot commented Dec 22, 2021 • edited Loading

Codecov Report

quinnj Dec 22, 2021

Choose a reason for hiding this comment

wentasah Dec 22, 2021

Choose a reason for hiding this comment

wentasah Dec 23, 2021

Choose a reason for hiding this comment

quinnj Dec 23, 2021

Choose a reason for hiding this comment

quinnj Dec 23, 2021

Choose a reason for hiding this comment

wentasah Dec 23, 2021

Choose a reason for hiding this comment

quinnj commented Dec 24, 2021

codecov bot commented Dec 22, 2021 •

edited

Loading