readdlm() performance problems #10384

theran · 2015-03-03T07:02:26Z

I have a 42.5G CSV file with 10 columns of doubles. When I try readdlm(filename,';'), julia eventually allocates over 150G and doesn't finish reading in 45 minutes or so. For comparison, on this machine, reading the same data from the binary representation takes around 3 minutes.

The text was updated successfully, but these errors were encountered:

nalimilan · 2015-03-03T08:33:29Z

The comparison with the binary representation isn't really fair. Have you tried reading the CSV data using another program instead? That will give a better idea of the improvements that can be achieved in the Julia implementation. (FWIW, if you're looking for a good competitor, the R data.table package contains a fread() function which claims to be very fast.)

Other than that, have you tried specifying the types of the columns? The dims argument? Or setting use_mmap=true if you're on Windows? BTW, what version of Julia are you using?

theran · 2015-03-03T12:42:43Z

I converted to binary using scanf() from the C standard library in about 20 minutes, including output. You're right that it's not fair to compare to not doing any parsing at all, but there is nothing else to compare to easily, given the behavior on a pretty basic use case. The only point was that it wasn't getting bogged down because of thrashing. If I get a chance, I can try to play around with it in the future.

This is Julia "Version 0.3.5 (2015-01-08 22:33 UTC)" on (from uname -a "Linux fn01 2.6.32-504.3.3.el6.x86_64 #1 SMP Tue Dec 16 14:29:22 CST 2014 x86_64 x86_64 x86_64 GNU/Linux").

johnmyleswhite · 2015-03-03T15:17:50Z

The real problem is whether the parser attempts to load the whole thing into memory. With a file that large, you either need a really big RAM machine or you need to do incremental parsing.

JeffBezanson · 2015-03-03T17:08:41Z

My guess is the new GC in 0.4 will help a lot with this.

ViralBShah · 2015-03-03T19:12:38Z

Cc: @tanmaykm

ViralBShah · 2015-03-03T19:13:23Z

@theran How much RAM does your computer have?

JeffBezanson · 2015-03-03T21:34:20Z

I tried a 900MB file and readdlm is about the same in 0.3 and 0.4. However writedlm seems to be significantly slower in 0.4. Not good!

JeffBezanson · 2015-03-03T21:39:34Z

Ah, I think much of the writedlm regression is due to #8972.

tanmaykm · 2015-03-04T04:28:08Z

I think specifying the dims argument will help greatly in this case as readdlm would then populate the result in one pass. It will also result in lower memory requirement.

Also, since the column type is known, specifying the type to readdlm may help bail out in case there is some data corruption in the file. Otherwise, readdlm falls back to parsing the file into an Any array which will be much slower.

tkelman · 2015-03-07T08:48:53Z

We don't really need both this and #10428 open. The latter has more numbers and is specific to master, so let's close this one.

theran · 2015-03-07T22:26:07Z

@ViralBShah I think 1T but it might be 2T. (This is a fat node on my department's cluster.)

JeffBezanson added the performance Must go faster label Mar 7, 2015

tkelman closed this as completed Mar 7, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readdlm() performance problems #10384

readdlm() performance problems #10384

theran commented Mar 3, 2015

nalimilan commented Mar 3, 2015

theran commented Mar 3, 2015

johnmyleswhite commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

ViralBShah commented Mar 3, 2015

ViralBShah commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

tanmaykm commented Mar 4, 2015

tkelman commented Mar 7, 2015

theran commented Mar 7, 2015

readdlm() performance problems #10384

readdlm() performance problems #10384

Comments

theran commented Mar 3, 2015

nalimilan commented Mar 3, 2015

theran commented Mar 3, 2015

johnmyleswhite commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

ViralBShah commented Mar 3, 2015

ViralBShah commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

JeffBezanson commented Mar 3, 2015

tanmaykm commented Mar 4, 2015

tkelman commented Mar 7, 2015

theran commented Mar 7, 2015