reduce gc load in readdlm #10465

tanmaykm · 2015-03-10T07:43:22Z

Using direct ccall wherever possible instead of creating SubString for every column during parsing.
This gets readdlm performance closer to that with gc disabled.
ref: #10428

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 206.317901076 seconds (17021 MB allocated, 45.34% gc time in 12 pauses with 9 full sweep)

julia> a=nothing; @time gc()
elapsed time: 75.903634913 seconds (96 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)

julia> # with gc disabled
       gc_disable()
true

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 116.272807009 seconds (17021 MB allocated)

julia> gc_enable(); a=nothing; @time gc();
elapsed time: 76.882703601 seconds (80 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)

Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing. ref: JuliaLang#10428

jiahao · 2015-03-10T12:35:16Z

99% reduction in execution time woot

ViralBShah · 2015-03-10T12:40:44Z

How far are we from pandas now?

Is this kind of a thing something our GC should be able to do better, or our compiler should be able to do better with automatic insertion of free statements?
Cc: @carnaval

jiahao · 2015-03-10T12:45:17Z

This change would bring us to 2.2x slower than pandas and 4.3x slower than R's data.table.

I'm guessing that there's more garbage reduction to be had.

jiahao · 2015-03-10T13:03:23Z

I have to say though that the timings can't be directly compared with the numbers I posted in the earlier issue. The current numbers are with the dimensions prespecified, which in my testing cut the execution time by 40%. Possibly the margin is smaller here.

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

Nonetheless, big speedup. Thanks @tanmaykm!

nalimilan · 2015-03-10T13:22:26Z

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

I understand with was mostly a joke, but there could be an argument to trigger counting the number of rows by going over the whole file once (in pure Julia code of course), and assume that the number of columns is that of the first line. If that's more efficient in practice, it could even be the default.

jiahao · 2015-03-10T15:13:49Z

shell>  cat > try.tsv #Has Evil Unicode and NUL character at the end of the first line
3   .14159  is  π  でしょう    
2   .71828  is  e   maybe   junk
^D
julia> readdlm("try.tsv")
2x6 Array{Any,2}:
 3.0  0.14159  "is"  "π"  "でしょう"   ""    
 2.0  0.71828  "is"  "e"  "maybe"  "junk"

👍

jiahao · 2015-03-10T15:20:13Z

shell> cat > try.tsv
1Doe    a   deer    ,   a       female deer
2Ray    a   drop    of  golden  sun
3Mi a   name    I   call    myself
^D
julia> readdlm("try.tsv")
3x7 Array{Any,2}:
 "1Doe"  "a"  "deer"  ","   "a"       "female"  "deer"
 "2Ray"  "a"  "drop"  "of"  "golden"  "sun"     ""    
 "3Mi"   "a"  "name"  "I"   "call"    "myself"  ""

shell> cat try.tsv #Has NUL character in the last field
5.uper  cali    fragi   listic  expe    a   lidocious
^D
julia> readdlm("try.tsv")
1x7 Array{Any,2}:
 "5.uper"  "cali"  "fragi"  "listic"  "expe"  "a"  "lidoc\0ious"

👍

jiahao · 2015-03-10T15:30:15Z

So a NUL character by itself in its own field is not parsed into "\0", but rather "". A NUL character is correctly read into a string if it occurs partway through the content. (See the π input.) I think that's the only edge case I've found so far.

shell> cat > try.tsv #Has NUL character and invisible space U+200B
3.141592αβ
^D
julia> readdlm("try.tsv", '\0')
1x2 Array{Any,2}:
 3.141  "592α\u200bβ"

👍

JeffBezanson · 2015-03-10T15:36:20Z

Thanks @tanmaykm , this is much needed!

It would be good to add some simple wrappers for substrtod and memcmp, to make the code less repetitive and more readable.

We should absolutely try very hard to guess the result size. Counting lines is very cheap. When we guess right, there will be a big improvement. If we guess wrong, it won't really be worse than it is now.

Hopefully some GC knobs can be tuned to better handle the case of a rapidly growing live heap.

hayd · 2015-03-10T17:54:10Z

wes' blog on memory efficient/fast read_csv in pandas suggests some csv files to benchmark (a lot of works gone into pandas + R + data.table read_csv since then).

tanmaykm · 2015-03-10T18:16:51Z

Yes. Thanks for the tips. I think I'll try something on these lines:

guess the number of columns and some of their attributes using the first few rows worth of data
guess the number of rows by making one pass over the file if the file size is within acceptable limits, else through extrapolation
if the guess turns out wrong, reallocate, copy over the already processed data, continue
process file in manageable chunks

jiahao · 2015-03-11T15:09:06Z

Let's merge this for now so that we can start eliminating the second pass through the data.

reduce gc load in readdlm

quinnj · 2015-03-24T13:26:14Z

Can this be backported? It'd be great to have these speedups in 0.3.

tkelman · 2015-03-25T09:35:38Z

There's some 0.4-specific syntax here. I can't tell whether anything else here would otherwise be 0.4-specific (though the surrounding code may be much different due to other intervening PR's?), but would definitely want to redo the performance comparison for any potential backport considering the GC is completely different.

Ken-B · 2015-04-04T13:44:14Z

@tanmaykm You can read more about some details on R DataTable's fread with a lot of references here:
http://www.inside-r.org/packages/cran/data.table/docs/fread

Eg. "The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types" and "There are no buffers used in fread's C code at all."

pao · 2015-04-04T16:30:15Z

@Ken-B your comment looks clipped.

Ken-B · 2015-04-05T18:53:08Z

@pao I was actually finished after the quote, but thanks anyway for the poke. I was just trying to add some inspiration to get closer to R DataTable's excellent fread function.

reduce gc load in readdlm

d3758a1

Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing. ref: JuliaLang#10428

ViralBShah added the performance Must go faster label Mar 10, 2015

jiahao added a commit that referenced this pull request Mar 11, 2015

Merge pull request #10465 from tanmaykm/readcsvopt

adb9095

reduce gc load in readdlm

jiahao merged commit adb9095 into JuliaLang:master Mar 11, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reduce gc load in readdlm #10465

reduce gc load in readdlm #10465

tanmaykm commented Mar 10, 2015

jiahao commented Mar 10, 2015

ViralBShah commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

nalimilan commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

JeffBezanson commented Mar 10, 2015

hayd commented Mar 10, 2015

tanmaykm commented Mar 10, 2015

jiahao commented Mar 11, 2015

quinnj commented Mar 24, 2015

tkelman commented Mar 25, 2015

Ken-B commented Apr 4, 2015

pao commented Apr 4, 2015

Ken-B commented Apr 5, 2015

reduce gc load in readdlm #10465

reduce gc load in readdlm #10465

Conversation

tanmaykm commented Mar 10, 2015

jiahao commented Mar 10, 2015

ViralBShah commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

nalimilan commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

jiahao commented Mar 10, 2015

JeffBezanson commented Mar 10, 2015

hayd commented Mar 10, 2015

tanmaykm commented Mar 10, 2015

jiahao commented Mar 11, 2015

quinnj commented Mar 24, 2015

tkelman commented Mar 25, 2015

Ken-B commented Apr 4, 2015

pao commented Apr 4, 2015

Ken-B commented Apr 5, 2015