-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File IO slow #8826
Comments
Try reading the file in one go, and then iterating over each line in memory. |
This should work and be fast – we've seen similar issues with out I/O performance, especially line-oriented reading. This should be fixed – and I'm currently working on it – well, string performance at the moment, but general I/O comes after that. |
@alexalemi, if the data ok to make public and you wouldn't mind posting it on S3 or Dropbox or something like that, this could make a good benchmark to measure progress of I/O and string improvements. If you don't want to make the data public but wouldn't mind sending it to me that would be ok too. |
We could also just grab every Shakespeare play ever or something, and make that the dataset if we can't use his. ;) |
Or do word count on all of Wikipedia. |
|
I was expecting more than just 5MB. Amazing. On Mon, Oct 27, 2014 at 1:51 PM, Jiahao Chen [email protected]
|
All of Gutenberg combined is pretty small. :) |
I was actually doing my test runs on a publicly available dataset, the text8 (text8.zip) set from Matt Mahoney, which itself is a processed dump of some of wikipedia (the first 10^8 bytes, more about the data) Eventually I would like to work on a much larger text corpus, where reading the whole thing into memory would be painful, if not impossible. Anyway, doing a |
Reading all the data into memory at once is not an official recommendation. Obviously, streaming over a text file should work just fine and the fact that it's not as fast as it should be is a performance issue. |
Yes, sorry, I should have expounded and said that I was merely trying to nail down for certain where the performance degradation was coming from; if the slowdown persists when the file is loaded into memory, it might have more to do with how we're iterating over the lines, e.g. string manipulation rather than disk I/O. In any case, what you have written definitely should work. |
For what its worth, I've thrown up the code I used for testing up in a git repository in the rough style of the micro benchmarks in /test/perf/micro if it helps. |
I get a linecount=1 when running the code below with fn="text8". It looks like this file contains a giant list of terms with no line breaks to iterate over. Having said that, I certainly agree that text IO should be faster. Looks like we could make EachLine immutable and get a speed bump from that.
|
I ran the code on the first 20th of your file. It looks like getting and setting on A couple notes:
|
@catawbasam True, I added |
I did similar benchmarks recently (In Python, Julia, C++, C, Haskell, Rust, @alexalemi so maybe I can pull request them in your repo?), and also wondered why Julia was slower. It seems that if you remove the update of counter in the main loop, so that the code just iterate on lines from a file, Julia is still 3 times slower than my Python version. |
@remusao Be my guest. |
Rewriting function mysplit(s::ASCIIString)
a = ASCIIString[]
i = 1
was_space = true
for j = 1:length(s)
is_space = isspace(s[j])
if is_space
was_space || push!(a, s[i:j-1])
else
was_space && (i = j)
end
was_space = is_space
end
was_space || push!(a, s[i:end])
a
end
function main(fn)
counts = Dict{ASCIIString,Int}()
open(fn,"r") do f
for line in eachline(f)
for word in mysplit(line)
counts[word] = get(counts,word,0) + 1
end
end
end
cut = 10
guys = collect(counts)
sort!( guys, by=(x)->x[2], rev=true )
for x in guys
if x[2] > cut
println(x[1], '\t', x[2])
end
end
end
isinteractive() || main(ARGS[1]) |
Cuts the #8826 vocab.jl benchmark in half.
I took a closer look at why rewriting
|
That is pretty awesome analysis and would make for a nice blog post, I think. |
+1 I felt the need for a similar function when writing code to compute frequency tables. An additional version |
On Thu, Nov 6, 2014 at 11:12 AM, Mike Nolta [email protected]
Kevin |
@kmsquire Great! |
Hey all, I am currently a Junior at the University of Notre Dame and as a final project we are looking to improve upon/increase the efficiency of some of the bugs that have been reported in Julia. I am in a team of four, and we have all had a lot of work with C, C++, python, etc. and we are taking a pretty advanced class in data structures currently. Wanted to post here and see if we could be of help/if you guys had any ideas. We were thinking about trying to solve this issue, but it seems like you guys have a lot of good ideas going around and are making some progress. Let us know what we can do to be of help! |
@rtick The way to contribute is to jump in and create a Pull Request. I am sure you have seen the other tags with performance. Another good project is the ARM port, or the distributed array infrastructure in Julia that needs a lot more work. Feel free to write to me at |
- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.
Cuts the JuliaLang#8826 vocab.jl benchmark in half.
About 1.5x faster. Helps with JuliaLang#8826.
- don't call utf8sizeof() twice when writing a Char - faster sizeof(::SubString{ASCIIString}) About 6% faster.
I just tested this on my machine(quadcore S7, 32 GB Ram, Ubuntu 14.04) and got the following timings ---Testing C 1.56user 0.02system 0:01.59elapsed 99%CPU (0avgtext+0avgdata 33512maxresident)k ---Testing Python 2 ---Testing Julia Julia's speed is about 1.6 times Python's. Which should probably be better, but the almost 8 times slow down seems to be fixed. |
We really should be as fast as C! I agree that this is good enough to close this issue. Perhaps the thing to do here is to figure out where we are losing and file separate issues. |
I'm trying to count how often each word occurs in a file, and I'm getting roughly 10x slower speeds for julia compared to C, python or go
On a 96 M text file, it is taking julia 54.35 seconds on my machine, versus 2.69s for equivalent C code, 7.35 s for python, and 5.98 s for Go.
Profiling the code, it seems the actual speed is the time spent reading the file. I've seen other mentions of slow IO in regards to reading csv files (e.g. #3350 or this group discussion) where some specific fixes were worked out, but these discussions are a year old and I was wondering whether there were hopes of speeding up I/O more generally.
I've thrown together a git repository of the test code.
The text was updated successfully, but these errors were encountered: