-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce gc load in readdlm #10465
reduce gc load in readdlm #10465
Conversation
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing. ref: JuliaLang#10428
99% reduction in execution time woot |
How far are we from pandas now? Is this kind of a thing something our GC should be able to do better, or our compiler should be able to do better with automatic insertion of free statements? |
This change would bring us to 2.2x slower than pandas and 4.3x slower than R's data.table. I'm guessing that there's more garbage reduction to be had. |
I have to say though that the timings can't be directly compared with the numbers I posted in the earlier issue. The current numbers are with the dimensions prespecified, which in my testing cut the execution time by 40%. Possibly the margin is smaller here. Maybe we should have a "big data" mode that determines the data dimensions by shelling out to wc and head | cut, since that takes less than 2 seconds for files of this size ;) Nonetheless, big speedup. Thanks @tanmaykm! |
I understand with was mostly a joke, but there could be an argument to trigger counting the number of rows by going over the whole file once (in pure Julia code of course), and assume that the number of columns is that of the first line. If that's more efficient in practice, it could even be the default. |
shell> cat > try.tsv #Has Evil Unicode and NUL character at the end of the first line
3 .14159 is π でしょう
2 .71828 is e maybe junk
^D
julia> readdlm("try.tsv")
2x6 Array{Any,2}:
3.0 0.14159 "is" "π" "でしょう" ""
2.0 0.71828 "is" "e" "maybe" "junk" 👍 |
shell> cat > try.tsv
1Doe a deer , a female deer
2Ray a drop of golden sun
3Mi a name I call myself
^D
julia> readdlm("try.tsv")
3x7 Array{Any,2}:
"1Doe" "a" "deer" "," "a" "female" "deer"
"2Ray" "a" "drop" "of" "golden" "sun" ""
"3Mi" "a" "name" "I" "call" "myself" "" shell> cat try.tsv #Has NUL character in the last field
5.uper cali fragi listic expe a lidocious
^D
julia> readdlm("try.tsv")
1x7 Array{Any,2}:
"5.uper" "cali" "fragi" "listic" "expe" "a" "lidoc\0ious" 👍 |
So a NUL character by itself in its own field is not parsed into
👍 |
Thanks @tanmaykm , this is much needed! It would be good to add some simple wrappers for substrtod and memcmp, to make the code less repetitive and more readable. We should absolutely try very hard to guess the result size. Counting lines is very cheap. When we guess right, there will be a big improvement. If we guess wrong, it won't really be worse than it is now. Hopefully some GC knobs can be tuned to better handle the case of a rapidly growing live heap. |
wes' blog on memory efficient/fast read_csv in pandas suggests some csv files to benchmark (a lot of works gone into pandas + R + data.table read_csv since then). |
Yes. Thanks for the tips. I think I'll try something on these lines:
|
Let's merge this for now so that we can start eliminating the second pass through the data. |
Can this be backported? It'd be great to have these speedups in 0.3. |
There's some 0.4-specific syntax here. I can't tell whether anything else here would otherwise be 0.4-specific (though the surrounding code may be much different due to other intervening PR's?), but would definitely want to redo the performance comparison for any potential backport considering the GC is completely different. |
@tanmaykm You can read more about some details on R DataTable's fread with a lot of references here: Eg. "The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types" and "There are no buffers used in fread's C code at all." |
@Ken-B your comment looks clipped. |
@pao I was actually finished after the quote, but thanks anyway for the poke. I was just trying to add some inspiration to get closer to R DataTable's excellent fread function. |
Using direct
ccall
wherever possible instead of creatingSubString
for every column during parsing.This gets
readdlm
performance closer to that with gc disabled.ref: #10428