-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster readcsv. fixes #3350 #3429
Conversation
That is so cool! |
This is cool. My immediate gut reaction was "you can't just slurp the whole file!" but then I realized that in this case you're going to read the whole thing anyway, so you might as well read it all at once. I wonder if we get even better performance by mmapping the file. |
faster readcsv. fixes #3350
Amazing it is only 7 lines longer than the old version. |
The memory usage slurping the whole file is also roughly the same. This is a huge improvement for working with data. |
I doubt mmapping will help since the read part is a small fraction of the total time. Should be easy to try it out. |
Yeah, that's a fair point, but it might shave off a little time and make the code easier since mmap just gives you a byte array from a file with basically no work. |
mmaps may make noticeable difference while processing huge files as there would be no duplication of buffers. Continuing the discussion from #3350, would any of the following also be useful in
While 1 and 2 can be built over what |
I think it would be useful to be able to filter out headers and comments, and have per column types. I am not sure if quoted columns should belong in Base or if they should be in DataFrames. Also, I feel that this fast |
And perhaps allow you to more efficiently import csv files that are larger than memory. Since ASCII is usually several times larger than the binary version, you could get yourself into a situation where you have enough RAM to handle the data, but not enough to slurp the whole ASCII file. |
#3442 will introduce a variant of Can be used as:
|
Changes to
readdlm
for speed improvements. Instead ofreadline
andsplit
the new method instead slurps the whole file into memory, parses it linearly for delimiter offsets and usesSubString
wherever possible.With this patch:
Prior to this patch: