Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster readcsv. fixes #3350 #3429

Merged
merged 1 commit into from
Jun 18, 2013
Merged

Conversation

tanmaykm
Copy link
Member

Changes to readdlm for speed improvements. Instead of readline and split the new method instead slurps the whole file into memory, parses it linearly for delimiter offsets and uses SubString wherever possible.

With this patch:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2039.r3771bf24.dirty
 _/ |\__'_|_|_|\__'_|  |  Commit 3771bf244d 2013-06-17 22:15:40*
|__/                   |  x86_64-apple-darwin12.4.0

julia> @time s = readcsv("vList.csv")
elapsed time: 16.183337526 seconds
1608716x9 Any Array:
   1.0  166.0  "Govindarajanagar  "  "RDU3444981"  "Mahesh B"                                            "Basavaraju"                 " ..."                     22.0   "M"
   2.0  166.0  "Govindarajanagar "    "RDU3524188"  "Kemparaju"                                           "Chikkanachari"              " ."                       34.0   "M"

Prior to this patch:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.2.0-2030.rbf13c7aa
 _/ |\__'_|_|_|\__'_|  |  Commit bf13c7aaf5 2013-06-16 04:52:40
|__/                   |  x86_64-apple-darwin12.4.0

julia> @time s = readcsv("vList.csv")
elapsed time: 151.918892449 seconds
1608716x9 Any Array:
   1.0  166.0  "Govindarajanagar  "  "RDU3444981"  "Mahesh B"                                            "Basavaraju"                 " ..."                     22.0   "M"
   2.0  166.0  "Govindarajanagar "    "RDU3524188"  "Kemparaju"                                           "Chikkanachari"              " ."                       34.0   "M"

@JeffBezanson
Copy link
Member

That is so cool!

@StefanKarpinski
Copy link
Member

This is cool. My immediate gut reaction was "you can't just slurp the whole file!" but then I realized that in this case you're going to read the whole thing anyway, so you might as well read it all at once. I wonder if we get even better performance by mmapping the file.

JeffBezanson added a commit that referenced this pull request Jun 18, 2013
@JeffBezanson JeffBezanson merged commit b44da64 into JuliaLang:master Jun 18, 2013
@JeffBezanson
Copy link
Member

Amazing it is only 7 lines longer than the old version.

@ViralBShah
Copy link
Member

The memory usage slurping the whole file is also roughly the same. This is a huge improvement for working with data.

@ViralBShah
Copy link
Member

I doubt mmapping will help since the read part is a small fraction of the total time. Should be easy to try it out.

@StefanKarpinski
Copy link
Member

Yeah, that's a fair point, but it might shave off a little time and make the code easier since mmap just gives you a byte array from a file with basically no work.

@tanmaykm
Copy link
Member Author

mmaps may make noticeable difference while processing huge files as there would be no duplication of buffers.

Continuing the discussion from #3350, would any of the following also be useful in readdlm?

  1. handling header row
  2. per column data type
  3. handling quoted columns

While 1 and 2 can be built over what readdlm returns, 3 needs to be done while parsing the buffer.

@ViralBShah
Copy link
Member

I think it would be useful to be able to filter out headers and comments, and have per column types. I am not sure if quoted columns should belong in Base or if they should be in DataFrames.

Also, I feel that this fast readdlm should be refactored so that DataFrames and other packages can use its intermediate work (the array of substrings) to process csv files and produce different data structures, handle values that need to be represented by NA, etc.

@timholy
Copy link
Member

timholy commented Jun 18, 2013

it might shave off a little time and make the code easier since mmap just gives you a byte array from a file with basically no work

And perhaps allow you to more efficiently import csv files that are larger than memory. Since ASCII is usually several times larger than the binary version, you could get yourself into a situation where you have enough RAM to handle the data, but not enough to slurp the whole ASCII file.

@tanmaykm
Copy link
Member Author

#3442 will introduce a variant of readdlm that can accept memory mapped byte arrays.

Can be used as:

f = open("vList.csv", "r")
csv = readcsv(mmap_array(Uint8, (109504806,), f))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants