-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extreme slowness for wide CSV files #275
Comments
Yes, there's a magical threshold of 500. The new |
Does it require any other dependent packages?
|
@youngjaewoo, no, on CSV.jl master, there is entirely new parsing machinery that lives under the pkg> add Tables#master
pkg> add CSV#master
pkg> add DataFrames#mdavezac-tables_integration Then doing |
@quinnj Awesome! Just tried testing it out but I can't load
This has been a bottleneck in my genetics work pipeline and am very excited about getting this fixed. |
Hmmmm, I just bumped both Project.toml versions; can you try again? In the worst case, you might have to |
@quinnj Nice, It works now. I tested one of my work files and here's the result.
I ran CSV.read counterpart as a comparison, but it doesn't seem to run (or haven't finished yet). Not too important. In Julia 0.6, my experience is that the same table is imported in 15+ minutes if
So.. I also had a problem with
|
The new machinery is awesome!
|
Well... comparing with Pandas, it's not as awesome anymore.... 0.03 seconds vs. 2.13 seconds. What could be the bottleneck?
|
CSV.jl isn't very well optimized for small-row scenarios like this. I think if you compared a larger # of rows, you'd find them more comparable. There's probably some low-hanging fruit here for performance, but so far, the biggest amount of effort has been spent on huge # of rows (millions & millions), followed most recently by large # of columns. Also, at least for the moment, there will always be a bit of overhead for compilation for the "first run"; in this latest round of refactoring, I've tried to make that as small as possible, but especially for large # of columns, that overhead will still play a factor for the first run. julia> gencsv(1000, 1)
"random_1000_1.csv"
julia> @time df = DataFrame(CSV.File("random_1000_1.csv"))
1.669940 seconds (4.21 M allocations: 137.956 MiB, 2.53% gc time)
1×1000 DataFrame. Omitted printing of 986 columns
│ Row │ col1 │ col2 │ col3 │ col4 │ col5 │ col6 │ col7 │ col8 │ col9 │ col10 │ col11 │ col12 │ col13 │ col14 │
├─────┼──────────┼────────┼───────────┼─────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 0.837208 │ 0.4249 │ 0.0213853 │ 0.70883 │ 0.0335719 │ 0.110028 │ 0.567079 │ 0.469167 │ 0.501812 │ 0.654153 │ 0.773796 │ 0.47041 │ 0.321471 │ 0.359337 │
julia> @time df = DataFrame(CSV.File("random_1000_1.csv"))
0.009046 seconds (46.96 k allocations: 1.873 MiB)
1×1000 DataFrame. Omitted printing of 986 columns
│ Row │ col1 │ col2 │ col3 │ col4 │ col5 │ col6 │ col7 │ col8 │ col9 │ col10 │ col11 │ col12 │ col13 │ col14 │
├─────┼──────────┼────────┼───────────┼─────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 1 │ 0.837208 │ 0.4249 │ 0.0213853 │ 0.70883 │ 0.0335719 │ 0.110028 │ 0.567079 │ 0.469167 │ 0.501812 │ 0.654153 │ 0.773796 │ 0.47041 │ 0.321471 │ 0.359337 │
|
Oh... your "first run" comment surprised me. I thought first run only applies when the code is first run but it appears that there is also a hit for reading the file the first time? Is it generating custom code for each file?
Taking the first run issue aside, it gets better with more rows but still 5x difference when compared to pandas for a 1,000 rows. julia> gencsv(1000, 1000)
"random_1000_1000.csv"
julia> @benchmark DataFrame(CSV.File("random_1000_1000.csv")) seconds=30
BenchmarkTools.Trial:
memory estimate: 241.43 MiB
allocs estimate: 14146767
--------------
minimum time: 2.442 s (12.39% GC)
median time: 2.518 s (12.59% GC)
mean time: 2.520 s (13.37% GC)
maximum time: 2.604 s (13.91% GC)
--------------
samples: 12
evals/sample: 1 Python: In [3]: t1 = time.time(); df = pd.read_csv("random_1000_1000.csv"); time.time() - t1
Out[3]: 0.5107619762420654 |
Yes, like I mentioned, right now the code is optimized for large row datasets (i.e. millions of rows), where I believe it may be faster than any other parser. There are probably some things we could to to improve these other cases, but at least in my personal uses, is not as high a priority. Yes, we compile specialized code for each file after detecting it's shape/types. |
I understand. My general use cases don't go that far though. Even with 50,000 rows x 1,000 columns (1 GiB) file, there's still a large gap (96s vs. 25s) when compared to Pandas. IMHO, there is a sweet spot where general users will enjoy Julia more when it's snappier. I'm closing this issue as you already have awesome new machinery in the pipeline. Thanks again for the help and insights above. |
In the last case, we're actually hitting a Float parsing performance issue when floats have full precision. You can see the difference by modifying the ulia> function gencsv(n, m)
df = DataFrame([round.(rand(m), digits=7) for _ ∈ 1:n], Symbol.(["col$i" for i ∈ 1:n]))
CSV.write("random_$(n)_$(m).csv", df)
end
gencsv (generic function with 1 method)
julia> gencsv(1000, 50000)
"random_1000_50000.csv"
julia> @time df = DataFrame(CSV.File("random_1000_50000.csv"));
10.684627 seconds (16.91 M allocations: 689.887 MiB, 1.54% gc time) |
Now I know your computer is faster than mine :-) Just fyi - here's my test results. Note that the file size is cut in half when using 7 digits but performance is at least 5x better.
|
I need to read a CSV file with 1,000 columns (mostly Float64's) and 1,000 rows but the
read
function never finishes. I replicated the problem with only 1 row and various numbers of columns. Please see below.It seems that performance goes down quickly after passing 500 columns. Am I hitting some kind of magical threshold?
My config:
The text was updated successfully, but these errors were encountered: