Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce gc load in readdlm #10465

Merged
merged 1 commit into from
Mar 11, 2015
Merged

reduce gc load in readdlm #10465

merged 1 commit into from
Mar 11, 2015

Conversation

tanmaykm
Copy link
Member

Using direct ccall wherever possible instead of creating SubString for every column during parsing.
This gets readdlm performance closer to that with gc disabled.
ref: #10428

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 206.317901076 seconds (17021 MB allocated, 45.34% gc time in 12 pauses with 9 full sweep)

julia> a=nothing; @time gc()
elapsed time: 75.903634913 seconds (96 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)

julia> # with gc disabled
       gc_disable()
true

julia> @time a = readdlm("try7nonuls", '\t', dims=(10^7,46));
elapsed time: 116.272807009 seconds (17021 MB allocated)

julia> gc_enable(); a=nothing; @time gc();
elapsed time: 76.882703601 seconds (80 bytes allocated, 100.00% gc time in 2 pauses with 1 full sweep)

Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing.

ref: JuliaLang#10428
@ViralBShah ViralBShah added the performance Must go faster label Mar 10, 2015
@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

99% reduction in execution time woot

@ViralBShah
Copy link
Member

How far are we from pandas now?

Is this kind of a thing something our GC should be able to do better, or our compiler should be able to do better with automatic insertion of free statements?
Cc: @carnaval

@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

This change would bring us to 2.2x slower than pandas and 4.3x slower than R's data.table.

I'm guessing that there's more garbage reduction to be had.

@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

I have to say though that the timings can't be directly compared with the numbers I posted in the earlier issue. The current numbers are with the dimensions prespecified, which in my testing cut the execution time by 40%. Possibly the margin is smaller here.

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

Nonetheless, big speedup. Thanks @tanmaykm!

@nalimilan
Copy link
Member

Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;)

I understand with was mostly a joke, but there could be an argument to trigger counting the number of rows by going over the whole file once (in pure Julia code of course), and assume that the number of columns is that of the first line. If that's more efficient in practice, it could even be the default.

@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

shell>  cat > try.tsv #Has Evil Unicode and NUL character at the end of the first line
3   .14159  is  π  でしょう    
2   .71828  is  e   maybe   junk
^D
julia> readdlm("try.tsv")
2x6 Array{Any,2}:
 3.0  0.14159  "is"  "π"  "でしょう"   ""    
 2.0  0.71828  "is"  "e"  "maybe"  "junk"

👍

@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

shell> cat > try.tsv
1Doe    a   deer    ,   a       female deer
2Ray    a   drop    of  golden  sun
3Mi a   name    I   call    myself
^D
julia> readdlm("try.tsv")
3x7 Array{Any,2}:
 "1Doe"  "a"  "deer"  ","   "a"       "female"  "deer"
 "2Ray"  "a"  "drop"  "of"  "golden"  "sun"     ""    
 "3Mi"   "a"  "name"  "I"   "call"    "myself"  ""    
shell> cat try.tsv #Has NUL character in the last field
5.uper  cali    fragi   listic  expe    a   lidocious
^D
julia> readdlm("try.tsv")
1x7 Array{Any,2}:
 "5.uper"  "cali"  "fragi"  "listic"  "expe"  "a"  "lidoc\0ious"

👍

@jiahao
Copy link
Member

jiahao commented Mar 10, 2015

So a NUL character by itself in its own field is not parsed into "\0", but rather "". A NUL character is correctly read into a string if it occurs partway through the content. (See the π input.) I think that's the only edge case I've found so far.

shell> cat > try.tsv #Has NUL character and invisible space U+200B
3.141592α​β
^D
julia> readdlm("try.tsv", '\0')
1x2 Array{Any,2}:
 3.141  "592α\u200bβ"

👍

@JeffBezanson
Copy link
Member

Thanks @tanmaykm , this is much needed!

It would be good to add some simple wrappers for substrtod and memcmp, to make the code less repetitive and more readable.

We should absolutely try very hard to guess the result size. Counting lines is very cheap. When we guess right, there will be a big improvement. If we guess wrong, it won't really be worse than it is now.

Hopefully some GC knobs can be tuned to better handle the case of a rapidly growing live heap.

@hayd
Copy link
Member

hayd commented Mar 10, 2015

wes' blog on memory efficient/fast read_csv in pandas suggests some csv files to benchmark (a lot of works gone into pandas + R + data.table read_csv since then).

@tanmaykm
Copy link
Member Author

Yes. Thanks for the tips. I think I'll try something on these lines:

  • guess the number of columns and some of their attributes using the first few rows worth of data
  • guess the number of rows by making one pass over the file if the file size is within acceptable limits, else through extrapolation
  • if the guess turns out wrong, reallocate, copy over the already processed data, continue
  • process file in manageable chunks

@jiahao
Copy link
Member

jiahao commented Mar 11, 2015

Let's merge this for now so that we can start eliminating the second pass through the data.

jiahao added a commit that referenced this pull request Mar 11, 2015
@jiahao jiahao merged commit adb9095 into JuliaLang:master Mar 11, 2015
@quinnj
Copy link
Member

quinnj commented Mar 24, 2015

Can this be backported? It'd be great to have these speedups in 0.3.

@tkelman
Copy link
Contributor

tkelman commented Mar 25, 2015

There's some 0.4-specific syntax here. I can't tell whether anything else here would otherwise be 0.4-specific (though the surrounding code may be much different due to other intervening PR's?), but would definitely want to redo the performance comparison for any potential backport considering the GC is completely different.

@Ken-B
Copy link
Contributor

Ken-B commented Apr 4, 2015

@tanmaykm You can read more about some details on R DataTable's fread with a lot of references here:
http://www.inside-r.org/packages/cran/data.table/docs/fread

Eg. "The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types" and "There are no buffers used in fread's C code at all."

@pao
Copy link
Member

pao commented Apr 4, 2015

@Ken-B your comment looks clipped.

@Ken-B
Copy link
Contributor

Ken-B commented Apr 5, 2015

@pao I was actually finished after the quote, but thanks anyway for the poke. I was just trying to add some inspiration to get closer to R DataTable's excellent fread function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants