-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading very large text files #19141
Comments
I think issue is not the best place to ask questions. You can ask at the julia-users group or stackoverflow. BTW, if you'd like to load sequences in a FASTA file, Bio.jl has a FASTA parser. |
I get ~0.8 seconds in python and ~5 seconds in julia, so it seems like there is a performance issue to fix here. I'm not sure why you're seeing 30x slower; what version of julia are you using? |
Thanks for the reopen. This is julia 0.5.0 via Ubuntu PPA (staticfloat). Did you get 5 secs with exactly the julia code I posted? I get 16-24 secs but generally about 20 secs each time I try. Consistently about 600-700ms with the python code I posted. |
Answering @bicycle1885 above. Using Bio.seq,
takes 19 seconds to run, similar to my code. |
Your Julia seems to be much slower than mine (note that the first call includes JIT compiling time):
Also note that what your Python code does is just reading and concatenating strings while Bio.jl does checking validity and encoding data into As a reference, I ran your Python code on my computer (Python 3.4.1):
and your Julia code (Julia 0.5.0):
|
I get
Note the vastly bigger number of allocations for the Bio.jl code compared to what you see. But for my readfasta function there is no major difference in allocations. This is with a fresh update of julia to the latest in the PPA (still 0.5.0), and after a Pkg.update(). Should I be trying a Julia install from github directly? |
I installed julia 0.5.1-pre+4 (2016-10-28 19:39 UTC) from github. Pretty much the same timings as above. Baffled why @JeffBezanson sees faster-but-still-slow performance while @bicycle1885 sees performance comparable to python. Can it be something OS-related? This is Ubuntu Xenial, Intel i7-4770 CPU @ 3.40GHz, 32GB RAM. ZFS filesystem, but I tried reading from an ext4 filesystem and the results are the same. On my laptop (i5-5200U CPU @2.20 GHz, 4GB RAM, Ubuntu Xenial, ZFS) both of these are much faster: the readfasta function takes 6-16 seconds (very variable, and later calls aren't always faster) and the FASTAReader call takes about 5-6 seconds. I don't understand the difference, since it's a slower machine with less RAM running the same OS. Even so, python 2.7.12 takes 700ms on the function I posted. If I translate my julia function to python (read a list of lines, truncate them, join them) it takes about 10 seconds, not too different from julia. But if I translate my python function to julia it takes a very very long time apparently because concatenating strings is slow in julia. |
This is very mysterious: after a reboot (previous uptime had been about 2 months) the exact same readfasta function as above takes 1.8 seconds, 10x faster than before. The Bio.jl function takes 1.8 seconds too, again 10x faster. The python function takes 0.6 seconds, almost unchanged from earlier. So I am wondering if it is an OS issue, but if so, why was it affecting julia and not python (or, as far as I can tell, any other software that read/wrote large files, such as bedtools)? |
With gc_enable(false) the readfasta time goes down from avg 1.6-1.8 seconds to 1.1-1.2 seconds. If performance slows in the future to the extent I saw previously, I will give it a try and report back. |
However, right now with gc_enable(true) I am getting 25% gc time so the speedup I see without gc is reasonable. In the extremely slow numbers I posted earlier, it was 4% gc time. |
Well, in that case it's because that code in julia expresses an O(n^2) algorithm. |
If you are using laptops, be sure to be consistent in the power mode of the processor.. in my case I took months to realise that ubuntu wasn't updating to full power mode when power was plugged back, but only after a hibernation/resume sequence (and that's was a 10-20x performance difference). |
Hi. Is julia using C strings internally to concatenate the data ? It looks a lot like the use of C strings for concatenation of a large amount of text. Since those string are zero terminated, they behave a single linked list and concatenation is O(n) when using strcat. Then doing that in a while loop (concatenating characters one by one) is making it effectively O(n^2) which would explain the gradual degradation in performance. Actually, it is the sum of 1+2+3+4+5+... = sum(1..n) = n*(n+1)/2 iterations. I met this issue in C and Delphi in the past. A workaround solution was to use a temporary (on the C stack) local variable as a buffer to concatenate say up to 100 characters and then to concatenate that back to the main stream. Another much better solution is to use two pointers (start/end of string) and pointer arithmetic for insertion at least and that yields O(1). |
Julia does not use C strings. But the algorithm is still O(n^2) with Julia's string implementation, which is what @JeffBezanson wrote above. The solution is fairly simple – use the function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
return sprint() do io
for line in eachline(f)
print(io, line[1:end-1])
end
end
end When this issue was opened, the default behavior of function readfasta(filename::String)
f = open(filename)
l0 = readline(f)
return sprint() do io
for line in eachline(f)
print(io, line)
end
end
end The 0.6 version of Julia also included a much faster |
I found similar problems with the performance of functions that use ExampleInputI've used this large file for the example: https://raw.githubusercontent.com/diegozea/mitos-benchmarks/master/data/PF00089.fasta I do not perform string concatenation as in the previous example. I'm only using Julia 0.6.4function countchars(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, chomp=false)
c += length(line)
end
c
end
end Julia 0.7.0-beta2.81function countchars(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, keep=true)
c += length(line)
end
c
end
end Python 2.7.12 & Python 3.5.2def countchars(filename):
with open(filename, "r") as fh:
c = 0
for line in fh:
c += len(line)
return c Timing(after compilation in Julia) Julia 0.6.4julia> @time countchars("PF00089.fasta")
0.220063 seconds (1.63 M allocations: 99.055 MiB, 4.79% gc time)
32301307 Julia 0.7.0-beta2.81julia> @time countchars("PF00089.fasta")
0.421870 seconds (1.63 M allocations: 57.543 MiB, 0.99% gc time)
32301307 Python 2.7.12In [3]: %time countchars("PF00089.fasta")
CPU times: user 92.2 ms, sys: 12.3 ms, total: 104 ms
Wall time: 103 ms
Out[3]: 32301307 Python 3.5.2In [3]: %time countchars("PF00089.fasta")
CPU times: user 154 ms, sys: 32.7 ms, total: 187 ms
Wall time: 174 ms
Out[3]: 32301307 Type instability of
|
The Lines 845 to 846 in cbb6433
But: julia> code_warntype(readline, Tuple{IO})
Body::String
429 1 ─ goto 2 if not false │╻ #readline#304
2 ┄ %2 = Base.getfield(%%s, :ios)::Array{UInt8,1} ││╻ getproperty
│ Base.sle_int(0, 167772160) │││╻╷╷╷╷ convert
│ Base.ifelse(true, 10, 0) ││││╻ Type
│ %5 = π (false, Bool) ││
└── goto 3 if not %5 ││
3 ┄ %7 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(2))))::Ptr{UInt8}nvert
│ %8 = Base.bitcast(Ptr{Nothing}, %7)::Ptr{Nothing} │││╻ convert
│ %9 = :($(Expr(:foreigncall, :(:jl_readuntil), Ref{String}, svec(Ptr{Nothing}, UInt8, UInt8, UInt8), :(:ccall), 4, Core.SSAValue(8), 0x0a, 0x01, 0x02, 0x02, 0x01, 0x0a, Core.SSAValue(2))))::String
└── goto 4 ││
4 ─ return %9 │
Body::Any
369 1 ─ %1 = Base.:(#readline#276)(Base.nothing, false, %%#self#, %%s)::Any │
└── return %1 The second method that cannot be inferred here is this one: Lines 368 to 382 in cbb6433
The problem here seems to be that two of the readuntil methods (readuntil(io::Base.AbstractPipe, arg::UInt8; kw...) in Base at io.jl:232 and readuntil(this::Base.LibuvStream, c::UInt8; keep) in Base at stream.jl:769) are inferred as Any , and there are too many methods for String(::Any) for inference.Possible solutions (not mutually exclusive):
|
Can confirm that putting
and improves performance: Without:
With:
|
Something seems broken in codegen, since even if we annotate the return type, |
With #28253 (Julia 0.7.0-beta2.126) julia> Profile.@profile countchars("PF00089.fasta")
32301307
julia> Profile.print()
2 ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
139 ./task.jl:262; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:119; macro expansion
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:87; eval_user_input(::Any, ::REPL.REPLBackend)
139 ./boot.jl:319; eval(::Module, ::Any)
139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/Profile/src/Profile.jl:27; top-level scope
139 ./REPL[1]:2; countchars
139 ./iostream.jl:367; open
139 ./iostream.jl:369; #open#304(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Main, Symbol("##3#4"...
139 ./REPL[1]:5; (::getfield(Main, Symbol("##3#4")))(::IOStream)
4 ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
4 ./iostream.jl:193; eof
3 ./pointer.jl:66; unsafe_convert
3 ./pointer.jl:65; unsafe_convert
80 ./io.jl:897; iterate(::Base.EachLine{IOStream}, ::Nothing)
1 ./boot.jl:321; kwfunc(::Any)
73 ./none:0; #readline
73 ./iostream.jl:433; #readline#306
53 ./strings/string.jl:269; length(::String)
1 ./int.jl:428; length
13 ./pointer.jl:0; length
1 ./strings/string.jl:273; length
1 ./strings/string.jl:88; codeunit
1 ./gcutils.jl:87; macro expansion
1 ./pointer.jl:105; unsafe_load
1 ./pointer.jl:105; unsafe_load
10 ./strings/string.jl:276; length
10 ./int.jl:53; +
18 ./strings/string.jl:277; length
10 ./strings/string.jl:278; length
10 ./strings/string.jl:88; codeunit
10 ./gcutils.jl:87; macro expansion
10 ./pointer.jl:105; unsafe_load
10 ./pointer.jl:105; unsafe_load
1 ./strings/string.jl:276; length(::String) Counting lines instead of chars to not call Julia 0.7.0-beta2.126julia> function countlines(filename)
open(filename, "r") do fh
c = 0
for line in eachline(fh, keep=true)
c += 1
end
c
end
end julia> @time countlines("PF00089.fasta")
0.059442 seconds (1.09 M allocations: 49.240 MiB, 6.86% gc time)
544104 Pythondef countlines(filename):
with open(filename, "r") as fh:
c = 0
for line in fh:
c += 1
return c Python 2.7.12In [2]: %time countlines("PF00089.fasta")
CPU times: user 56.8 ms, sys: 16.5 ms, total: 73.3 ms
Wall time: 72.7 ms
Out[2]: 544104 Python 3.5.2In [5]: %time countlines("PF00089.fasta")
CPU times: user 123 ms, sys: 4.26 ms, total: 128 ms
Wall time: 126 ms
Out[5]: 544104 So, we are really iterating lines in a file faster than Python :) |
And using a but more idiomatic code
I don't think there is anything specific that warrants this issue to be kept open? Feel free to reopen if you disagree. |
Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.
Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz
In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.
A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.
I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.
A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.
The text was updated successfully, but these errors were encountered: