Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme slowness for wide CSV files #275

Closed
tk3369 opened this issue Sep 13, 2018 · 15 comments
Closed

Extreme slowness for wide CSV files #275

tk3369 opened this issue Sep 13, 2018 · 15 comments

Comments

@tk3369
Copy link
Contributor

tk3369 commented Sep 13, 2018

I need to read a CSV file with 1,000 columns (mostly Float64's) and 1,000 rows but the read function never finishes. I replicated the problem with only 1 row and various numbers of columns. Please see below.

It seems that performance goes down quickly after passing 500 columns. Am I hitting some kind of magical threshold?

julia> function gencsv(n, m)
         df = DataFrame([rand(m) for _ ∈ 1:n], Symbol.(["col$i" for i ∈ 1:n]))
         CSV.write("random_$(n)_$(m).csv", df)
       end
gencsv (generic function with 1 method)

julia> r = 50:50:1000
50:50:1000

julia> for i in r
         gencsv(i, 1)
         println(Dates.now(), " done with $i x 1")
       end
2018-09-12T22:39:58.212 done with 50 x 1
2018-09-12T22:39:58.218 done with 100 x 1
2018-09-12T22:39:58.229 done with 150 x 1
2018-09-12T22:39:58.245 done with 200 x 1
2018-09-12T22:39:58.274 done with 250 x 1
2018-09-12T22:39:58.307 done with 300 x 1
2018-09-12T22:39:58.359 done with 350 x 1
2018-09-12T22:39:58.424 done with 400 x 1
2018-09-12T22:39:58.508 done with 450 x 1
2018-09-12T22:39:58.607 done with 500 x 1
2018-09-12T22:40:02.123 done with 550 x 1
2018-09-12T22:40:06.033 done with 600 x 1
2018-09-12T22:40:10.412 done with 650 x 1
2018-09-12T22:40:15.801 done with 700 x 1
2018-09-12T22:40:22.833 done with 750 x 1
2018-09-12T22:40:30.637 done with 800 x 1
2018-09-12T22:40:39.5 done with 850 x 1
2018-09-12T22:40:48.624 done with 900 x 1
2018-09-12T22:40:57.842 done with 950 x 1
2018-09-12T22:41:06.999 done with 1000 x 1

julia> for i in r
         println(now(), " reading $i x 1 file")
         @time CSV.read("random_$(i)_1.csv")
       end
2018-09-12T22:41:48.011 reading 50 x 1 file
  0.029750 seconds (5.01 k allocations: 297.500 KiB)
2018-09-12T22:41:48.042 reading 100 x 1 file
  0.118615 seconds (10.06 k allocations: 638.078 KiB)
2018-09-12T22:41:48.161 reading 150 x 1 file
  0.286931 seconds (15.12 k allocations: 1.020 MiB)
2018-09-12T22:41:48.448 reading 200 x 1 file
  0.497838 seconds (20.41 k allocations: 1.540 MiB)
2018-09-12T22:41:48.946 reading 250 x 1 file
  0.777382 seconds (25.46 k allocations: 2.049 MiB)
2018-09-12T22:41:49.724 reading 300 x 1 file
  1.081642 seconds (30.52 k allocations: 2.630 MiB)
2018-09-12T22:41:50.806 reading 350 x 1 file
  1.467806 seconds (35.56 k allocations: 3.251 MiB)
2018-09-12T22:41:52.274 reading 400 x 1 file
  1.907915 seconds (40.62 k allocations: 3.930 MiB)
2018-09-12T22:41:54.182 reading 450 x 1 file
  2.444327 seconds (45.66 k allocations: 4.668 MiB)
2018-09-12T22:41:56.627 reading 500 x 1 file
  3.229724 seconds (50.72 k allocations: 5.462 MiB)
2018-09-12T22:41:59.857 reading 550 x 1 file
 42.911497 seconds (5.19 M allocations: 284.770 MiB, 0.89% gc time)
2018-09-12T22:42:42.771 reading 600 x 1 file
 55.315812 seconds (5.59 M allocations: 309.474 MiB, 0.63% gc time)
2018-09-12T22:43:38.09 reading 650 x 1 file
 56.074652 seconds (6.00 M allocations: 334.785 MiB, 2.04% gc time)
2018-09-12T22:44:34.168 reading 700 x 1 file
 59.868686 seconds (6.40 M allocations: 360.505 MiB, 0.58% gc time)
2018-09-12T22:45:34.038 reading 750 x 1 file
 67.366178 seconds (6.80 M allocations: 386.072 MiB, 0.55% gc time)
2018-09-12T22:46:41.407 reading 800 x 1 file
 78.462792 seconds (7.21 M allocations: 412.113 MiB, 0.53% gc time)
2018-09-12T22:47:59.873 reading 850 x 1 file
 88.311945 seconds (7.62 M allocations: 439.732 MiB, 0.69% gc time)
2018-09-12T22:49:28.189 reading 900 x 1 file
119.831256 seconds (8.02 M allocations: 467.226 MiB, 0.51% gc time)
2018-09-12T22:51:28.026 reading 950 x 1 file
133.664229 seconds (8.43 M allocations: 495.060 MiB, 0.78% gc time)
2018-09-12T22:53:41.69 reading 1000 x 1 file
155.248115 seconds (8.83 M allocations: 523.604 MiB, 0.72% gc time)

My config:

(v1.0) pkg> st
    Status `~/.julia/environments/v1.0/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.1
  [336ed68f] CSV v0.3.1
  [a93c6f00] DataFrames v0.13.1

julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 4

@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

Yes, there's a magical threshold of 500. The new CSV.File on master though should be quite a bit smarter about this, particularly w/ repeated column types. It'd be interesting to try your exercise again but doing CSV.File(file) |> DataFrame instead of CSV.read(file).

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 13, 2018

Does it require any other dependent packages?

julia> for i in r
         println(now(), " reading $i x 1 file")
         @time CSV.File("random_$(i)_1.csv") |> DataFrame
       end
2018-09-12T23:32:41.993 reading 50 x 1 file
ERROR: MethodError: no method matching DataFrame(::CSV.File{NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8, :col9, :col10, :col11, :col12, :col13, :col14, :col15, :col16, :col17, :col18, :col19, :col20, :col21, :col22, :col23, :col24, :col25, :col26, :col27, :col28, :col29, :col30, :col31, :col32, :col33, :col34, :col35, :col36, :col37, :col38, :col39, :col40, :col41, :col42, :col43, :col44, :col45, :col46, :col47, :col48, :col49, :col50),NTuple{50,Union{Missing, Float64}}},false,Base.GenericIOBuffer{Array{UInt8,1}},Parsers.Delimited{false,Parsers.Quoted{Parsers.Strip{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}},Parsers.Trie{0x0a,true,missing,8,Tuple{}},Parsers.Trie{0x0d,true,missing,8,Tuple{Parsers.Trie{0x0a,true,missing,8,Tuple{}}}}}}},NamedTuple{(),Tuple{}}})
Closest candidates are:
  DataFrame(::Any, ::DataStreams.Data.Schema, ::Type{S}, ::Bool; reference) where S at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/abstractdataframe/io.jl:295
  DataFrame(::Array{Any,1}, ::DataFrames.Index) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:87
  DataFrame(; kwargs...) at /Users/tomkwong/.julia/packages/DataFrames/utxEh/src/dataframe/dataframe.jl:142
  ...
Stacktrace:
 [1] |>(::CSV.File{NamedTuple{(:col1, :col2, :col3, :col4, :col5, :col6, :col7, :col8, :col9, :col10, :col11, :col12, :col13, :col14, :col15, :col16, :col17, :col18, :col19, :col20, :col21, :col22, :col23, :col24, :col25, :col26, :col27, :col28, :col29, :col30, :col31, :col32, :col33, :col34, :col35, :col36, :col37, :col38, :col39, :col40, :col41, :col42, :col43, :col44, :col45, :col46, :col47, :col48, :col49, :col50),NTuple{50,Union{Missing, Float64}}},false,Base.GenericIOBuffer{Array{UInt8,1}},Parsers.Delimited{false,Parsers.Quoted{Parsers.Strip{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}},Parsers.Trie{0x0a,true,missing,8,Tuple{}},Parsers.Trie{0x0d,true,missing,8,Tuple{Parsers.Trie{0x0a,true,missing,8,Tuple{}}}}}}},NamedTuple{(),Tuple{}}}, ::Type) at ./operators.jl:813
 [2] top-level scope at ./util.jl:156

(v1.0) pkg> st
    Status `~/.julia/environments/v1.0/Project.toml`
  [6e4b80f9] BenchmarkTools v0.4.1
  [336ed68f] CSV v0.2.5 #master (https://github.com/JuliaData/CSV.jl.git)
  [5d742f6a] CSVFiles v0.9.1
  [a93c6f00] DataFrames v0.13.1
  [31c24e10] Distributions v0.16.4
  [587475ba] Flux v0.6.7
  [033835bb] JLD2 v0.1.2
  [4076af6c] JuMP v0.18.2
  [50d2b5c4] Lazy v0.13.1
  [cc2ba9b6] MLDataUtils v0.4.0
  [429524aa] Optim v0.17.1
  [91a5bcdd] Plots v0.20.2
  [ce6b1742] RDatasets v0.4.0
  [ee283ea6] Rebugger v0.1.4
  [295af30f] Revise v0.7.10
  [60ddc479] StatPlots v0.8.1
  [2913bbd2] StatsBase v0.25.0
  [b8865327] UnicodePlots v0.3.1

@youngjaewoo
Copy link

@tk3369 . Perhaps @quinnj is referring to CSVFiles package. It is a bit faster in my experience.

https://github.com/queryverse/CSVFiles.jl

@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

@youngjaewoo, no, on CSV.jl master, there is entirely new parsing machinery that lives under the CSV.File constructor. It takes essentially the same arguments as CSV.read, but returns a CSV.File object that iterates rows (and implements the new Tables.jl interface). The DataFrames integration with Tables.jl is still in a PR at this point, so in order to try it out, you can do:

pkg> add Tables#master
pkg> add CSV#master
pkg> add DataFrames#mdavezac-tables_integration

Then doing CSV.File(file) |> DataFrame should work. Sorry for not clarifying this before.

@youngjaewoo
Copy link

@quinnj Awesome! Just tried testing it out but I can't load CSV.File function. It may be due to this?

(v0.7) pkg> add Tables#master
  Updating git-repo `https://github.com/JuliaData/Tables.jl.git`
 Resolving package versions...
ERROR: Unsatisfiable requirements detected for package Tables [bd369af6]:
 Tables [bd369af6] log:
 ├─possible versions are: 0.1.0 or uninstalled
 ├─restricted to versions * by CSV [336ed68f], leaving only versions 0.1.0
 │ └─CSV [336ed68f] log:
 │   ├─possible versions are: 0.2.5 or uninstalled
 │   └─CSV [336ed68f] is fixed to version 0.2.5
 ├─Tables [bd369af6] is fixed to version 0.1.0
 └─restricted to versions 0.1.4 by an explicit requirement — no versions left

This has been a bottleneck in my genetics work pipeline and am very excited about getting this fixed.

@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

Hmmmm, I just bumped both Project.toml versions; can you try again? In the worst case, you might have to pkg> rm CSV first, then try again.

@youngjaewoo
Copy link

@quinnj Nice, It works now. I tested one of my work files and here's the result.

julia> @time dd = CSV.File("$mo0_wkdir/test.csv") |> DataFrame
 40.767477 seconds (196.39 M allocations: 4.194 GiB, 5.24% gc time)
243×6725 DataFrame. Omitted printing of 6718 columns

I ran CSV.read counterpart as a comparison, but it doesn't seem to run (or haven't finished yet). Not too important. In Julia 0.6, my experience is that the same table is imported in 15+ minutes if CSV.read is used.

julia> @time dd2 = CSV.read("$mo0_wkdir/test.csv") |> DataFrame
┌ Warning: CSV.read(file) will return a CSV.File object in the future; to return a DataFrame, use `df = CSV.read(file) |> DataFrame`
│   caller = ip:0x0
└ @ Core :-1

So.. I also had a problem with CSV.write having a long time to write the same file. Any recommendations? At the moment, I'm writing files using writedlm function after converting the DataFrame into Array.

julia> @time dd |> CSV.write("test2.csv")
242.650145 seconds (188.31 M allocations: 4.418 GiB, 0.57% gc time)
"test2.csv"

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 13, 2018

The new machinery is awesome!

julia> for i in r
         println(now(), " reading $i x 1 file")
         @time CSV.File("random_$(i)_1.csv") |> DataFrame
       end
2018-09-13T14:30:20.839 reading 50 x 1 file
  2.678035 seconds (13.53 M allocations: 595.456 MiB, 14.48% gc time)
2018-09-13T14:30:23.662 reading 100 x 1 file
  0.378636 seconds (757.09 k allocations: 35.770 MiB, 2.11% gc time)
2018-09-13T14:30:24.041 reading 150 x 1 file
  0.479409 seconds (809.72 k allocations: 37.995 MiB, 2.98% gc time)
2018-09-13T14:30:24.52 reading 200 x 1 file
  0.537810 seconds (868.96 k allocations: 40.498 MiB, 2.55% gc time)
2018-09-13T14:30:25.059 reading 250 x 1 file
  0.607706 seconds (932.02 k allocations: 43.253 MiB, 2.44% gc time)
2018-09-13T14:30:25.667 reading 300 x 1 file
  0.666309 seconds (1.00 M allocations: 45.997 MiB, 2.23% gc time)
2018-09-13T14:30:26.334 reading 350 x 1 file
  0.727092 seconds (1.08 M allocations: 49.032 MiB, 1.83% gc time)
2018-09-13T14:30:27.061 reading 400 x 1 file
  0.797276 seconds (1.16 M allocations: 52.293 MiB, 2.46% gc time)
2018-09-13T14:30:27.859 reading 450 x 1 file
  0.867660 seconds (1.24 M allocations: 55.745 MiB, 1.60% gc time)
2018-09-13T14:30:28.727 reading 500 x 1 file
  0.961551 seconds (1.33 M allocations: 59.364 MiB, 2.36% gc time)
2018-09-13T14:30:29.689 reading 550 x 1 file
  1.058806 seconds (1.51 M allocations: 64.489 MiB, 2.36% gc time)
2018-09-13T14:30:30.748 reading 600 x 1 file
  1.128946 seconds (1.71 M allocations: 70.212 MiB, 2.31% gc time)
2018-09-13T14:30:31.877 reading 650 x 1 file
  1.246766 seconds (1.93 M allocations: 76.215 MiB, 2.83% gc time)
2018-09-13T14:30:33.124 reading 700 x 1 file
  1.350643 seconds (2.17 M allocations: 82.561 MiB, 1.93% gc time)
2018-09-13T14:30:34.475 reading 750 x 1 file
  1.436988 seconds (2.44 M allocations: 89.427 MiB, 2.62% gc time)
2018-09-13T14:30:35.913 reading 800 x 1 file
  1.560247 seconds (2.72 M allocations: 96.634 MiB, 2.79% gc time)
2018-09-13T14:30:37.473 reading 850 x 1 file
  1.676558 seconds (3.02 M allocations: 104.539 MiB, 2.46% gc time)
2018-09-13T14:30:39.15 reading 900 x 1 file
  1.799242 seconds (3.35 M allocations: 112.569 MiB, 2.26% gc time)
2018-09-13T14:30:40.95 reading 950 x 1 file
  1.946015 seconds (3.69 M allocations: 120.977 MiB, 2.17% gc time)
2018-09-13T14:30:42.896 reading 1000 x 1 file
  2.135481 seconds (4.05 M allocations: 129.892 MiB, 3.07% gc time)

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 15, 2018

Well... comparing with Pandas, it's not as awesome anymore.... 0.03 seconds vs. 2.13 seconds. What could be the bottleneck?

In [19]: t1 = time.time(); df = pd.read_csv("random_1000_1.csv"); time.time() - t1
Out[19]: 0.03284716606140137

@quinnj
Copy link
Member

quinnj commented Sep 15, 2018

CSV.jl isn't very well optimized for small-row scenarios like this. I think if you compared a larger # of rows, you'd find them more comparable. There's probably some low-hanging fruit here for performance, but so far, the biggest amount of effort has been spent on huge # of rows (millions & millions), followed most recently by large # of columns. Also, at least for the moment, there will always be a bit of overhead for compilation for the "first run"; in this latest round of refactoring, I've tried to make that as small as possible, but especially for large # of columns, that overhead will still play a factor for the first run.

julia> gencsv(1000, 1)
"random_1000_1.csv"

julia> @time df = DataFrame(CSV.File("random_1000_1.csv"))
  1.669940 seconds (4.21 M allocations: 137.956 MiB, 2.53% gc time)
1×1000 DataFrame. Omitted printing of 986 columns
│ Row │ col1     │ col2   │ col3      │ col4    │ col5      │ col6     │ col7     │ col8     │ col9     │ col10    │ col11    │ col12   │ col13    │ col14    │
├─────┼──────────┼────────┼───────────┼─────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 10.8372080.42490.02138530.708830.03357190.1100280.5670790.4691670.5018120.6541530.7737960.470410.3214710.359337 │

julia> @time df = DataFrame(CSV.File("random_1000_1.csv"))
  0.009046 seconds (46.96 k allocations: 1.873 MiB)
1×1000 DataFrame. Omitted printing of 986 columns
│ Row │ col1     │ col2   │ col3      │ col4    │ col5      │ col6     │ col7     │ col8     │ col9     │ col10    │ col11    │ col12   │ col13    │ col14    │
├─────┼──────────┼────────┼───────────┼─────────┼───────────┼──────────┼──────────┼──────────┼──────────┼──────────┼──────────┼─────────┼──────────┼──────────┤
│ 10.8372080.42490.02138530.708830.03357190.1100280.5670790.4691670.5018120.6541530.7737960.470410.3214710.359337

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 15, 2018

Oh... your "first run" comment surprised me. I thought first run only applies when the code is first run but it appears that there is also a hit for reading the file the first time? Is it generating custom code for each file?

julia> foo1(r) = for i in r
         file = "random_$(i)_1.csv"
         println(now(), " $i columns x 1 row:")
         @time CSV.File("random_$(i)_1.csv") |> DataFrame
         @time CSV.File("random_$(i)_1.csv") |> DataFrame
         @time CSV.File("random_$(i)_1.csv") |> DataFrame
       end

julia> foo1(200:200:1000)
2018-09-15T09:39:39.183 200 columns x 1 row:
  0.892967 seconds (1.15 M allocations: 54.833 MiB, 9.05% gc time)
  0.001869 seconds (9.68 k allocations: 383.766 KiB)
  0.001727 seconds (9.45 k allocations: 379.875 KiB)
2018-09-15T09:39:40.223 400 columns x 1 row:
  0.802431 seconds (1.16 M allocations: 52.273 MiB, 1.56% gc time)
  0.003141 seconds (18.88 k allocations: 712.797 KiB)
  0.002921 seconds (18.32 k allocations: 701.672 KiB)
2018-09-15T09:39:41.032 600 columns x 1 row:
  1.129694 seconds (1.71 M allocations: 70.203 MiB, 1.55% gc time)
  0.004722 seconds (27.82 k allocations: 1.157 MiB)
  0.004378 seconds (27.16 k allocations: 1.146 MiB)
2018-09-15T09:39:42.171 800 columns x 1 row:
  1.559849 seconds (2.72 M allocations: 96.589 MiB, 2.18% gc time)
  0.005739 seconds (37.15 k allocations: 1.488 MiB)
  0.005630 seconds (36.25 k allocations: 1.472 MiB)
2018-09-15T09:39:43.743 1000 columns x 1 row:
  2.082767 seconds (4.05 M allocations: 129.980 MiB, 1.82% gc time)
  0.007395 seconds (46.80 k allocations: 1.826 MiB)
  0.007601 seconds (45.61 k allocations: 1.805 MiB)

Taking the first run issue aside, it gets better with more rows but still 5x difference when compared to pandas for a 1,000 rows.

julia> gencsv(1000, 1000)
"random_1000_1000.csv"

julia> @benchmark DataFrame(CSV.File("random_1000_1000.csv")) seconds=30
BenchmarkTools.Trial: 
  memory estimate:  241.43 MiB
  allocs estimate:  14146767
  --------------
  minimum time:     2.442 s (12.39% GC)
  median time:      2.518 s (12.59% GC)
  mean time:        2.520 s (13.37% GC)
  maximum time:     2.604 s (13.91% GC)
  --------------
  samples:          12
  evals/sample:     1

Python:

In [3]: t1 = time.time(); df = pd.read_csv("random_1000_1000.csv"); time.time() - t1
Out[3]: 0.5107619762420654

@quinnj
Copy link
Member

quinnj commented Sep 15, 2018

Yes, like I mentioned, right now the code is optimized for large row datasets (i.e. millions of rows), where I believe it may be faster than any other parser. There are probably some things we could to to improve these other cases, but at least in my personal uses, is not as high a priority.

Yes, we compile specialized code for each file after detecting it's shape/types.

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 15, 2018

I understand. My general use cases don't go that far though. Even with 50,000 rows x 1,000 columns (1 GiB) file, there's still a large gap (96s vs. 25s) when compared to Pandas. IMHO, there is a sweet spot where general users will enjoy Julia more when it's snappier.

I'm closing this issue as you already have awesome new machinery in the pipeline. Thanks again for the help and insights above.

@tk3369 tk3369 closed this as completed Sep 15, 2018
@quinnj
Copy link
Member

quinnj commented Sep 15, 2018

In the last case, we're actually hitting a Float parsing performance issue when floats have full precision. You can see the difference by modifying the gencsv to even round to just 7 digits of precision

ulia> function gencsv(n, m)
                df = DataFrame([round.(rand(m), digits=7) for _  1:n], Symbol.(["col$i" for i  1:n]))
                CSV.write("random_$(n)_$(m).csv", df)
              end
gencsv (generic function with 1 method)

julia> gencsv(1000, 50000)
"random_1000_50000.csv"

julia> @time df = DataFrame(CSV.File("random_1000_50000.csv"));
 10.684627 seconds (16.91 M allocations: 689.887 MiB, 1.54% gc time)

@tk3369
Copy link
Contributor Author

tk3369 commented Sep 15, 2018

Now I know your computer is faster than mine :-)

Just fyi - here's my test results. Note that the file size is cut in half when using 7 digits but performance is at least 5x better.

shell> ls -l random_1000_50000.csv random_1000_50000_digits7.csv
-rw-r--r--  1 tomkwong  staff  963495304 Sep 15 10:56 random_1000_50000.csv
-rw-r--r--  1 tomkwong  staff  494444408 Sep 15 12:39 random_1000_50000_digits7.csv

julia> @time CSV.File("random_1000_50000_digits7.csv") |> DataFrame;
 35.093596 seconds (21.03 M allocations: 823.909 MiB, 3.71% gc time)

julia> @time CSV.File("random_1000_50000.csv") |> DataFrame;
190.210782 seconds (362.28 M allocations: 6.263 GiB, 33.12% gc time)

julia> @time CSV.File("random_1000_50000_digits7.csv") |> DataFrame;
 36.249126 seconds (16.91 M allocations: 689.842 MiB, 4.14% gc time)

julia> @time CSV.File("random_1000_50000.csv") |> DataFrame;
207.585322 seconds (362.29 M allocations: 6.263 GiB, 29.11% gc time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants