faster readcsv. fixes #3350 #3429

tanmaykm · 2013-06-17T21:30:48Z

Changes to readdlm for speed improvements. Instead of readline and split the new method instead slurps the whole file into memory, parses it linearly for delimiter offsets and uses SubString wherever possible.

With this patch:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2039.r3771bf24.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit 3771bf244d 2013-06-17 22:15:40*
|__/                   |  x86_64-apple-darwin12.4.0

julia> @time s = readcsv("vList.csv")
elapsed time: 16.183337526 seconds
1608716x9 Any Array:
   1.0  166.0  "Govindarajanagar  "  "RDU3444981"  "Mahesh B"                                            "Basavaraju"                 " ..."                     22.0   "M"
   2.0  166.0  "Govindarajanagar "    "RDU3524188"  "Kemparaju"                                           "Chikkanachari"              " ."                       34.0   "M"

Prior to this patch:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2030.rbf13c7aa
 _/ |\__'_|_|_|\__'_|  |  Commit bf13c7aaf5 2013-06-16 04:52:40
|__/                   |  x86_64-apple-darwin12.4.0

julia> @time s = readcsv("vList.csv")
elapsed time: 151.918892449 seconds
1608716x9 Any Array:
   1.0  166.0  "Govindarajanagar  "  "RDU3444981"  "Mahesh B"                                            "Basavaraju"                 " ..."                     22.0   "M"
   2.0  166.0  "Govindarajanagar "    "RDU3524188"  "Kemparaju"                                           "Chikkanachari"              " ."                       34.0   "M"

JeffBezanson · 2013-06-17T22:00:02Z

That is so cool!

StefanKarpinski · 2013-06-17T23:47:55Z

This is cool. My immediate gut reaction was "you can't just slurp the whole file!" but then I realized that in this case you're going to read the whole thing anyway, so you might as well read it all at once. I wonder if we get even better performance by mmapping the file.

faster readcsv. fixes #3350

JeffBezanson · 2013-06-18T04:42:06Z

Amazing it is only 7 lines longer than the old version.

ViralBShah · 2013-06-18T04:55:57Z

The memory usage slurping the whole file is also roughly the same. This is a huge improvement for working with data.

ViralBShah · 2013-06-18T04:59:28Z

I doubt mmapping will help since the read part is a small fraction of the total time. Should be easy to try it out.

StefanKarpinski · 2013-06-18T05:03:41Z

Yeah, that's a fair point, but it might shave off a little time and make the code easier since mmap just gives you a byte array from a file with basically no work.

tanmaykm · 2013-06-18T07:10:35Z

mmaps may make noticeable difference while processing huge files as there would be no duplication of buffers.

Continuing the discussion from #3350, would any of the following also be useful in readdlm?

handling header row
per column data type
handling quoted columns

While 1 and 2 can be built over what readdlm returns, 3 needs to be done while parsing the buffer.

ViralBShah · 2013-06-18T07:14:55Z

I think it would be useful to be able to filter out headers and comments, and have per column types. I am not sure if quoted columns should belong in Base or if they should be in DataFrames.

Also, I feel that this fast readdlm should be refactored so that DataFrames and other packages can use its intermediate work (the array of substrings) to process csv files and produce different data structures, handle values that need to be represented by NA, etc.

timholy · 2013-06-18T11:10:13Z

it might shave off a little time and make the code easier since mmap just gives you a byte array from a file with basically no work

And perhaps allow you to more efficiently import csv files that are larger than memory. Since ASCII is usually several times larger than the binary version, you could get yourself into a situation where you have enough RAM to handle the data, but not enough to slurp the whole ASCII file.

tanmaykm · 2013-06-18T17:17:26Z

#3442 will introduce a variant of readdlm that can accept memory mapped byte arrays.

Can be used as:

f = open("vList.csv", "r")
csv = readcsv(mmap_array(Uint8, (109504806,), f))

faster readcsv. fixes JuliaLang#3350

261e350

JeffBezanson added a commit that referenced this pull request Jun 18, 2013

Merge pull request #3429 from tanmaykm/readcsv

b44da64

faster readcsv. fixes #3350

JeffBezanson merged commit b44da64 into JuliaLang:master Jun 18, 2013

quinnj mentioned this pull request Jun 18, 2013

readdlm should be able to ignore header/comments #536

Closed

tanmaykm mentioned this pull request Jun 18, 2013

readdlm variant to accept memory mapped byte arrays. #3442

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster readcsv. fixes #3350 #3429

faster readcsv. fixes #3350 #3429

tanmaykm commented Jun 17, 2013

JeffBezanson commented Jun 17, 2013

StefanKarpinski commented Jun 17, 2013

JeffBezanson commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

StefanKarpinski commented Jun 18, 2013

tanmaykm commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

timholy commented Jun 18, 2013

tanmaykm commented Jun 18, 2013

faster readcsv. fixes #3350 #3429

faster readcsv. fixes #3350 #3429

Conversation

tanmaykm commented Jun 17, 2013

JeffBezanson commented Jun 17, 2013

StefanKarpinski commented Jun 17, 2013

JeffBezanson commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

StefanKarpinski commented Jun 18, 2013

tanmaykm commented Jun 18, 2013

ViralBShah commented Jun 18, 2013

timholy commented Jun 18, 2013

tanmaykm commented Jun 18, 2013