-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance of readdlm on master #10428
Comments
There's no doubt that Pandas is doing much better than we are. That said, I'd be interested to see how CSVReaders compares. For very large data sets, it seems to do much better than |
Actually, I take that back. CSVReaders is slower: it just uses 50% as much memory, but takes noticeably longer to run. |
One of the things I suspect is problematic for the |
I tried the following code to get a sense of things:
I compared this to
"Only" a 10% speedup, but fairly shocking given how trivial my code is. |
Are you shocked because it's a 10% speedup or because it's only a 10% speedup? |
Because there's any speedup at all. A 500-line implementation ought to be much faster. |
FWIW, I've done a bunch of profiling while working on CSVReaders. The places where you get killed are often surprising. For example, I got a huge speedup by hardcoding the sentinel values that you check for missing values and Booleans. Having those sentinel values stored in arrays that get looped through was roughly 50% slower than hardcoding them. I'm not sure I agree with the heuristic that a larger implementation should be much faster. For example, a lot of the code in All that said, I would be ecstatic if we managed to figure out how to write a fast CSV parser in pure Julia, so I'm really happy to have you thinking about this. |
Part of the key seems to be separating CSV, which is a very complex format, from simple TSV. If you happen to have a dead-simple TSV file, we should be able to read it super fast. |
While that's clearly true in the abstract, I think it's important that the pandas benchmark is a reader for the full CSV format. Sent from my iPhone
|
With 20 more minutes of work, I can beat readdlm in this case by a full factor of 2 (https://gist.github.com/JeffBezanson/83fb1d79deefa2316067). I guess my point is that if you're going to have a large implementation, it might as well include special cases like this. Of course full CSV is what matters, but it's even worse to be slow on simple numeric TSV files. |
Much of the complexity of
Some of these can be disabled with appropriate flags, but they do have some residual impact. Also the approach of reading/mapping the whole file is probably not giving enough benefits anymore because of speedups in other parts of Julia. It is now proving to be a bottleneck with large files. A way to handle files in chunks will probably help here. It will be good to have a way of having a simple and fast method while being able to plugin one or more routines to handle the complexity, but without having a large impact on performance. |
I think the approach of parsing the entire file before determining output dimensions is having a large impact. With that disabled (by specifying
But I was not able to read a 40GB file because it ran into memory/gc issues. |
@tanmaykm here are the timings for the original example when the dimensions are pre-specified: julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.017998925 seconds (3 MB allocated)
elapsed time: 0.309120058 seconds (41 MB allocated, 41.21% gc time in 2 pauses with 1 full sweep)
elapsed time: 4.848010965 seconds (423 MB allocated, 57.23% gc time in 18 pauses with 2 full sweep)
elapsed time: 184.121671697 seconds (4234 MB allocated, 88.44% gc time in 154 pauses with 16 full sweep)
elapsed time: 15579.129143684 seconds (42359 MB allocated, 98.69% gc time in 1512 pauses with 154 full sweep) With gc disabled: julia> for i=3:7; @time A=readdlm("testfile.tsv", '\t', dims=(10^i,46)); end
elapsed time: 0.019123992 seconds (3 MB allocated)
elapsed time: 0.199007951 seconds (41 MB allocated)
elapsed time: 2.031139609 seconds (423 MB allocated)
elapsed time: 19.807757219 seconds (4234 MB allocated)
elapsed time: 175.472120788 seconds (42359 MB allocated) It does look like pre-specifying the dimensions helps cut execution time almost by half. So much the better, since the dimensions can be determined from UNIX shell commands first for a fraction of the cost in Julia ;-) |
Updated OP with timings for R's |
@jiahao If you're looking for a fast function to benchmark, |
@jiahao I'll try and profile the code to figure out memory allocation hotspots. Also the data is being read as an In this particular case, cleaning/converting these to numeric values/factors or separating them to be read from a different file may help:
|
Seconding @nalimilan, |
I'll have to re-devise the test then. This test file contains NUL characters which breaks |
Updated OP with timings for If anyone was interested, stripping the original file of NULs with |
Looks like the |
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing. ref: JuliaLang#10428
Using direct `ccall` wherever possible instead of creating `SubString` for every column during parsing. ref: JuliaLang#10428
Here is how
Unfortunately it looks like the quadratic scaling behavior still persists for n = 10^7 rows. At >200x slower than R's |
@carnaval can we turn a knob in the GC to make it back off faster? |
We can, the current heuristics are nowhere near perfect, but changing this require at least a bit of regression testing w.r.t. memory usage, which is a pain. |
Manually disabling GC is not acceptable. Surely the jump from 11 to 1946 pauses in the last table row can be avoided. I'm even ok with 39% GC time, just not 96%. |
Ok, I still haven't come around doing this, sorry. The 1st order improvement would probably be to introduce a custom StringArray which stores every String linearly in an IOBuffer and the offsets as ranges in a separate array. You would avoid most gc/allocation traffic, have much better cache behavior, etc. Mutation becomes a problem though, and pushing this to the maximum performance would probably makes you implement a specialized "depth-1" compacting gc for strings. I'm almost certain that doing it naievly at first would already be a huge improvement compared to allocating each string separatly. It's also easier for the gc because you only materialize the individual strings on Since a lot of columns seems to be string I think it would be very worthwhile to try it. |
Bump. Does something have to be done on the GC front here, or on reorganizing |
Some more GC related weirdness. I have a simple test, I create files containing n comma separated integers, use readall to read it all in as a single string, split it with "," as separator, print length of the read-in string and exit, using the unix time command to get timing data. Here are the timings for Julia vs Python 2.7 on my machine (ubuntu 14.04, 32 GB RAM)
The Python code #just read contents of a file so I can time it
import string
def read_my_file(file_name):
f = open(file_name)
s = f.read()
a = string.split(s,",")
print(len(s))
f.close()
read_my_file("sixhundredmillionints.txt") The Julia code with GC on #just read a file so I can time it
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f)
Julia code without GC # gc off to avoid gc flakiness
gc_disable();
f = open("sixhundredmillionints.txt")
s = readall(f)
a = split(s,",")
print(length(s));
close(f) |
probably the same thing that is solved on ob/gctune |
So (sorry newbie here, excuse the dumb question) are you saying the GC is fixed now and can handle large inputs without flailing? If so any idea when that branch will get merged into main? |
well it is not merged yet, but at least on this branch it should not do what I think it's doing in your example which is : believing it is doing well when it's not, and because of that not widening the collection interval. There is no "general fix" for this, only improvements although this one is a very needed one because otherwise in cases like yours the heuristics are actively making things worse. |
Something I've been working on Trying to load upto a 100 million row, 24 column dataset consisting of random integers, floats and strings, (limits: no other datatypes, no nulls,only comma as separator, doesn't infer schema) into a datastructure which is essentially this ColumnType = Union(Array{Int64,1},Array{Float64,1},Array{AbstractString,1})
#this should probably be a list, but array works for now
DataFrameType = Array{ColumnType,1} ie. a dataframe is an array of columns, each column being a Union of Int Array, Float Array, String Array , (this is not an original idea, this is how the Pandas dataframe works, modulo the types) I'm getting the following loading times on my 4 core 32 GB Ubuntu 14.04 machine. All times in seconds.
The code is here https://github.com/RaviMohan/loaddataframe . The Python code generates the dataset, the Julia code loads the data from the file into the above datastructure. If the code is basically correct and not doing anything stupid, (errors are always possible, all fixes appreciated in advance :-), ) it seems that without gc, plain Julia comes within a constant factor of R.fread, and handily beats R's read table. Of course R's routines do much more like inferring column datatype, handling nulls, handling multiple formats etc. Still it seems that fast loading of large csv files is possible, once the gc issues are fixed and/or enough memory is available. I don't know how to write idiomatic Julia, so the code might be too imperative etc. Any style pointers (and of course bug reports/fixes) greatly appreciated. |
@RaviMohan sounds like progress! How does Perhaps you and @quinnj could discuss how to further combine efforts between https://github.com/quinnj/CSV.jl and https://github.com/RaviMohan/loaddataframe. @quinnj has also implemented his own CSV reader and he showed me some tests showing also good performance. (Also a quick note, the syntax |
@jiahao I haven't yet tested on the test data set, primarily because I haven't implemented Date/DateTime parsing, and I don't yet have it on my local machine. But the code is very generic and you should see similair speedups (not to mention that a 100 million row csv actually loads completely, unlike in your original tests, memory being available - you'll need a lot of memory On my 32 GB machine, 100 million rows with 24 columns - and very short strings - completely overwhelms the RAM, and it doesn't work with gc on at all.) My focus with this was on getting the underlying datastructure correct, avoiding Array{Any ...} etc. This is the primary learning which comes out of this effort. We can avoid Array{Any} and still load 100 million lines of data (but schema inference is harder, see below) . And of course the Array of Columns design vs a 2D Array as the underlying datastructure Also, I have a hardcoded comma-as-separator which is probably not very conducive to the original data set. This is primarily the last in many iterations of code explicitly trying to see if we can come close to/beat R's reading speed, and is very hacky and is not really production quality code imo. (I must say I am very impressed that fairly straightforward Julia code can come close to R's fread. That is some very impressive language design/compiler work.) The next important things to do are to handle nulls ,and to generate schema from examining/sampling the dataset (vs hardcoding as of now). I'm not quite sure (yet) how to do this, since it seems to involve the possibility of the ColumnType changing dynamically. A column can have the first 10 million values look like Ints then have the rest be floating values say. If we are not using Any's then the column's type has to change when you encounter the first float. Of course this is not a problem for dynamic languages To handle nulls I tried to use Union (Array{Nullable(Int64),1} .. ) etc but the memory usage seems to blow up pretty badly on this and I can't figure out why. Something to pursue further. I don't think @quinnj 's CSV reader has much to learn from this on the parsing side as I am just using string split with '\n' and ',' respectively :-P . I don't deal with any of the complexities of CSV parsing, which can get pretty hairy and full of gnarly details. That said of course I'd be glad to help @quinnj in any way I can. String splitting also interacts weirdly with the gc when memory gets tight. I probably need to investigate and file a separate issue. (though it just might be an instance of #6103 ) Another thing I'm looking at in the coming weeks is to see if Spark's "Dataframe-as-API + multiple conforming components + a generic query compiler" pattern can be adapted to our needs. If so, users can use the API without really knowing or caring about implementation details, and multiple sources of data can be trivially added. We'll have simpler datastores than RDDs (of course!) but the basic architectural pattern might be valuable to decouple data sources and implementations. @ViralBShah was mentioning something called "JuliaDB". As and when it takes concrete form, there might be commonalities/common design problem etc. Thanks for the syntax pointer. Appreciated! |
Take the number of pointers transgressing the generational frontier into account when deciding wether to do a young or an old gen collection. Helps #10428. Didn't run the bench suite yet.
With all the recent successes of CSV.jl, I think it fair to close this issue. |
We should remove the base readdlm if we're going to tell people to use something different. |
As I recollect, we probably should leave |
Julia should have been built to use a database transparently for the user and be able to do any operation streaming from it (including mixed effects regression or bayesian statistics). |
@skanskan There's no question of whether "Julia should have been built" that way or not. |
This line of comments does not belong on the issue tracker. Please ask these things on julia-users. |
Here is a simple benchmark for reading in real-world data in TSV (tab-separated values) format. The data are financial time series from a proprietary source with 47 fields. The smaller samples are constructed from taking the first
n
rows of the original file, which has 100,654,985 rows and is 21.879 GiB in size.Timings reported in seconds. Timings under 60s are best of 5. Timings over 60s were run once.
wc -l
grep -c
data.table fread
R read.table
pandas.read_csv
readdlm
DataFrames.readtable
timeit pandas.read_csv
gc_disabled readdlm
gc_disabled DataFrames.readtable
*timings obtained from files with NUL characters stripped.
Note that the pandas timings obtained with
timeit
have the Python garbage collector disabled, and so are fair comparisons with thegc_disable
d numbers in Julia. The first pandas column is timed with simpletime.time()
wrappers and are analogous to Julia@time
reports.It's quite clear from the timings that garbage collection causes the scaling to become superlinear, with the result that
readdlm
andreadtable
become unusable for large data, even on a machine large enough to read all the data into memory. However, the baseline performance of pandas is still far superior to what we have to offer in Julia, even with garbage collection turned off. This is perhaps unsurprising, since the main pandas parser is written in Cython and not pure Python.We could probably do much better if
readdlm
generated less garbage.Actual commands run:
Julia
@time
infoThe text was updated successfully, but these errors were encountered: