Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readdlm() performance problems #10384

Closed
theran opened this issue Mar 3, 2015 · 11 comments
Closed

readdlm() performance problems #10384

theran opened this issue Mar 3, 2015 · 11 comments
Labels
performance Must go faster

Comments

@theran
Copy link

theran commented Mar 3, 2015

I have a 42.5G CSV file with 10 columns of doubles. When I try readdlm(filename,';'), julia eventually allocates over 150G and doesn't finish reading in 45 minutes or so. For comparison, on this machine, reading the same data from the binary representation takes around 3 minutes.

@nalimilan
Copy link
Member

The comparison with the binary representation isn't really fair. Have you tried reading the CSV data using another program instead? That will give a better idea of the improvements that can be achieved in the Julia implementation. (FWIW, if you're looking for a good competitor, the R data.table package contains a fread() function which claims to be very fast.)

Other than that, have you tried specifying the types of the columns? The dims argument? Or setting use_mmap=true if you're on Windows? BTW, what version of Julia are you using?

@theran
Copy link
Author

theran commented Mar 3, 2015

I converted to binary using scanf() from the C standard library in about 20 minutes, including output. You're right that it's not fair to compare to not doing any parsing at all, but there is nothing else to compare to easily, given the behavior on a pretty basic use case. The only point was that it wasn't getting bogged down because of thrashing. If I get a chance, I can try to play around with it in the future.

This is Julia "Version 0.3.5 (2015-01-08 22:33 UTC)" on (from uname -a "Linux fn01 2.6.32-504.3.3.el6.x86_64 #1 SMP Tue Dec 16 14:29:22 CST 2014 x86_64 x86_64 x86_64 GNU/Linux").

@johnmyleswhite
Copy link
Member

The real problem is whether the parser attempts to load the whole thing into memory. With a file that large, you either need a really big RAM machine or you need to do incremental parsing.

@JeffBezanson
Copy link
Member

My guess is the new GC in 0.4 will help a lot with this.

@ViralBShah
Copy link
Member

Cc: @tanmaykm

@ViralBShah
Copy link
Member

@theran How much RAM does your computer have?

@JeffBezanson
Copy link
Member

I tried a 900MB file and readdlm is about the same in 0.3 and 0.4. However writedlm seems to be significantly slower in 0.4. Not good!

@JeffBezanson
Copy link
Member

Ah, I think much of the writedlm regression is due to #8972.

@tanmaykm
Copy link
Member

tanmaykm commented Mar 4, 2015

I think specifying the dims argument will help greatly in this case as readdlm would then populate the result in one pass. It will also result in lower memory requirement.

Also, since the column type is known, specifying the type to readdlm may help bail out in case there is some data corruption in the file. Otherwise, readdlm falls back to parsing the file into an Any array which will be much slower.

@JeffBezanson JeffBezanson added the performance Must go faster label Mar 7, 2015
@tkelman
Copy link
Contributor

tkelman commented Mar 7, 2015

We don't really need both this and #10428 open. The latter has more numbers and is specific to master, so let's close this one.

@tkelman tkelman closed this as completed Mar 7, 2015
@theran
Copy link
Author

theran commented Mar 7, 2015

@ViralBShah I think 1T but it might be 2T. (This is a fat node on my department's cluster.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

7 participants