-
Notifications
You must be signed in to change notification settings - Fork 369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame read_table performance? #273
Comments
Let me experiment with this. |
Thanks John. The file is a mismash of GUIDs, sparsely populated columns with missing data (flags basically), User-Agent strings, 20-30 character strings, timestamps, ints...so all data types really. 80k rows by 100 cols. |
Huh. Does it work if you truncate to the first 1000 rows and try loading On Mon, Jun 3, 2013 at 3:09 PM, randyzwitch [email protected]:
|
I don't know what the source of this is (I suspect GC), but I have a simple test case that demonstrates the problem:
|
I've been putting off speedups for |
For context on why this is so bizarre, reading 5,000 rows and 100 columns takes 5 seconds. Too long, but not 16x faster than the 80,000 row case. |
It's a whole other level of bizarre John...so the master file (the one I reported the error for) is a .csv created from an Excel spreadsheet. R reads this file just fine. Because I'm garbage at Unix, I did write.csv from R to create new files. 1k, 5k...50k...all. The "all.csv" file written from R gets read into Julia using about 1GB of memory and a bit of time (slow, but completes in a few minutes). Topically checking the files, the "Excel" version of the .csv is 49.6MB, the R version is 57MB. So R does something to clean up the file (fills with NA's?), and then it works in Julia. |
Strange. The CSV standard is absurdly permissive, so I'm not surprised that Excel does something wonky that we don't parse correctly. But our performance is too slow to be acceptable any longer. I'll make a major push this week. |
Also see readcsv performance in JuliaLang/julia#3350 |
@randyzwitch, any chance you could share your files for testing? I'm making some progress on this now and have managed to produce some code that does as well as R on some basic cases. (Unfortunately I'm turning off GC to do it.) |
@johnmyleswhite I can provide some anonymized data, no problem. Do you have an FTP? If not, I can provide S3 bucket, just let me know a way to contact you so I can send the details. |
I don't think I have any convenient FTP server you can upload to. Something on S3 with info sent to [email protected] would be great. |
I have a large file in my home dir on julia.mit.edu that used to make readtable barf. |
@ViralBShah: Your file fails on the following line
That line is a real mess. It's got 2 spaces at the start of a field, quotations midfield and then another space inside a field. I'm tempted to say it's just an invalid CSV that needs cleaning before use. Note that R 2.15.2 segfaults on your file:
|
The old |
I'll prepare a cleaned up version of this file. |
Closed by 664f792 |
Hey guys -
Just getting around to trying Julia (v 0.2-pre). Tried to read in a 50MB csv file on a 2013 quad-core MBP with 16GB of RAM using the following code:
Starting Julia and loading DataFrames uses < 100MB of RAM. Once I hit the read_table command, RAM usage shoots up to 3.4GB and 100% CPU (single core), and after 25 minutes, the csv file still hasn't loaded.
Is there something I'm missing in terms of the read_table function, a known bug, something else?
The text was updated successfully, but these errors were encountered: