Reading very large text files #19141

rsidd · 2016-10-28T07:30:05Z

Apologies if this is covered previously, but I am unable to find an answer after extensive reading. I did find one reference, discussed below.

Let's say I am reading the human genome chromosome 1, approx 0.25GB unzipped. The zipped file is available here
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz

In this file the first line is a "header" which I don't care about, and the rest is the sequence. I want the output to be one long string representing the sequence, concatenating all lines after throwing away the first line and removing trailing newline characters. A simple-minded Python function that does this is as follows. On my machine it takes less than 700ms to run.

def readfasta(filename):
    f = open(filename)
    l0 = f.readline()
    l0 = f.readline()[0:-1]
    for l in f:
        l0 += l[0:-1]
    return l0

A direct translation to Julia takes -- well, it doesn't seem to finish in any reasonable time, but it takes 11 seconds on a file 1/200 the size and the time seems to increase more than linearly with file size.

I found this post apparently addressing the matter: https://groups.google.com/forum/#!topic/julia-dev/UDllYRfm64w
The OP said reading the file into an array of strings helped. I tried that, and the following runs in 20 seconds -- still about 30 times slower than python.

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    strarr = map(x::String->x[1:end-1], readlines(f))
    return join(strarr,"")
end

A comment on that post suggested using IOBuffer, but I am not clear on how to do this (joining strings after dropping the last newline characters) more efficiently than what is being done above. Any help would be very welcome.

The text was updated successfully, but these errors were encountered:

bicycle1885 · 2016-10-28T07:42:20Z

I think issue is not the best place to ask questions. You can ask at the julia-users group or stackoverflow. BTW, if you'd like to load sequences in a FASTA file, Bio.jl has a FASTA parser.

JeffBezanson · 2016-10-29T06:13:53Z

I get ~0.8 seconds in python and ~5 seconds in julia, so it seems like there is a performance issue to fix here. I'm not sure why you're seeing 30x slower; what version of julia are you using?

rsidd · 2016-10-29T08:39:39Z

Thanks for the reopen. This is julia 0.5.0 via Ubuntu PPA (staticfloat).

Did you get 5 secs with exactly the julia code I posted? I get 16-24 secs but generally about 20 secs each time I try. Consistently about 600-700ms with the python code I posted.

rsidd · 2016-10-29T09:01:13Z

Answering @bicycle1885 above. Using Bio.seq,

@time for record in open(FASTAReader,"chr1.fa")
           print(length(record.seq))
end

takes 19 seconds to run, similar to my code.

bicycle1885 · 2016-10-29T09:17:35Z

Your Julia seems to be much slower than mine (note that the first call includes JIT compiling time):

julia> @time open(collect, FASTAReader, "chr1.fa");
  2.541710 seconds (1.34 M allocations: 548.084 MB, 9.27% gc time)

julia> @time open(collect, FASTAReader, "chr1.fa");
  1.478964 seconds (81 allocations: 494.553 MB, 15.05% gc time)

Also note that what your Python code does is just reading and concatenating strings while Bio.jl does checking validity and encoding data into BioSequence objects.

As a reference, I ran your Python code on my computer (Python 3.4.1):

In [5]: %time s = readfasta("chr1.fa")
CPU times: user 2.18 s, sys: 199 ms, total: 2.38 s
Wall time: 2.41 s

and your Julia code (Julia 0.5.0):

julia> @time readfasta("chr1.fa");
  4.506265 seconds (19.94 M allocations: 1.942 GB, 23.07% gc time)

rsidd · 2016-10-29T12:31:05Z

I get

julia> @time open(collect, FASTAReader, "chr1.fa");
 20.262316 seconds (25.87 M allocations: 690.482 MB, 0.84% gc time)

julia> @time open(collect, FASTAReader, "chr1.fa");
 21.800200 seconds (24.87 M allocations: 874.410 MB, 1.03% gc time)

julia> @time readfasta("chr1.fa");
 22.814977 seconds (19.94 M allocations: 1.978 GB, 4.46% gc time)

Note the vastly bigger number of allocations for the Bio.jl code compared to what you see. But for my readfasta function there is no major difference in allocations. This is with a fresh update of julia to the latest in the PPA (still 0.5.0), and after a Pkg.update(). Should I be trying a Julia install from github directly?

rsidd · 2016-10-31T06:38:40Z

I installed julia 0.5.1-pre+4 (2016-10-28 19:39 UTC) from github. Pretty much the same timings as above. Baffled why @JeffBezanson sees faster-but-still-slow performance while @bicycle1885 sees performance comparable to python. Can it be something OS-related? This is Ubuntu Xenial, Intel i7-4770 CPU @ 3.40GHz, 32GB RAM. ZFS filesystem, but I tried reading from an ext4 filesystem and the results are the same.

On my laptop (i5-5200U CPU @2.20 GHz, 4GB RAM, Ubuntu Xenial, ZFS) both of these are much faster: the readfasta function takes 6-16 seconds (very variable, and later calls aren't always faster) and the FASTAReader call takes about 5-6 seconds. I don't understand the difference, since it's a slower machine with less RAM running the same OS. Even so, python 2.7.12 takes 700ms on the function I posted.

If I translate my julia function to python (read a list of lines, truncate them, join them) it takes about 10 seconds, not too different from julia. But if I translate my python function to julia it takes a very very long time apparently because concatenating strings is slow in julia.

rsidd · 2016-11-01T05:02:57Z

This is very mysterious: after a reboot (previous uptime had been about 2 months) the exact same readfasta function as above takes 1.8 seconds, 10x faster than before. The Bio.jl function takes 1.8 seconds too, again 10x faster. The python function takes 0.6 seconds, almost unchanged from earlier.

So I am wondering if it is an OS issue, but if so, why was it affecting julia and not python (or, as far as I can tell, any other software that read/wrote large files, such as bedtools)?

jiahao · 2016-11-01T12:21:54Z

@rsidd what happens when you time the result after running gc_enable(false)? I wonder if the performance problem you are seeing has been visited before in #10428

rsidd · 2016-11-02T00:21:58Z

With gc_enable(false) the readfasta time goes down from avg 1.6-1.8 seconds to 1.1-1.2 seconds. If performance slows in the future to the extent I saw previously, I will give it a try and report back.

rsidd · 2016-11-02T00:26:00Z

However, right now with gc_enable(true) I am getting 25% gc time so the speedup I see without gc is reasonable. In the extremely slow numbers I posted earlier, it was 4% gc time.

JeffBezanson · 2016-11-02T00:32:46Z

it takes a very very long time apparently because concatenating strings is slow in julia

Well, in that case it's because that code in julia expresses an O(n^2) algorithm.

sylvaticus · 2017-07-12T15:08:21Z

If you are using laptops, be sure to be consistent in the power mode of the processor.. in my case I took months to realise that ubuntu wasn't updating to full power mode when power was plugged back, but only after a hibernation/resume sequence (and that's was a 10-20x performance difference).

oceanborn2 · 2017-12-17T11:07:52Z

Hi. Is julia using C strings internally to concatenate the data ?

It looks a lot like the use of C strings for concatenation of a large amount of text. Since those string are zero terminated, they behave a single linked list and concatenation is O(n) when using strcat.

Then doing that in a while loop (concatenating characters one by one) is making it effectively O(n^2) which would explain the gradual degradation in performance.

Actually, it is the sum of 1+2+3+4+5+... = sum(1..n) = n*(n+1)/2 iterations.

I met this issue in C and Delphi in the past. A workaround solution was to use a temporary (on the C stack) local variable as a buffer to concatenate say up to 100 characters and then to concatenate that back to the main stream.

Another much better solution is to use two pointers (start/end of string) and pointer arithmetic for insertion at least and that yields O(1).

StefanKarpinski · 2017-12-17T21:43:37Z

Julia does not use C strings. But the algorithm is still O(n^2) with Julia's string implementation, which is what @JeffBezanson wrote above. The solution is fairly simple – use the sprint function and write to the io object it provides, e.g.

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    return sprint() do io
        for line in eachline(f)
            print(io, line[1:end-1])
        end
    end
end

When this issue was opened, the default behavior of eachline and readline was to provide lines including the newline, but that's since been changed to automatically remove the newline unless you pass the chomp=false keyword argument, so the new version of this code would simply be:

function readfasta(filename::String)
    f = open(filename)
    l0 = readline(f)
    return sprint() do io
        for line in eachline(f)
            print(io, line)
        end
    end
end

The 0.6 version of Julia also included a much faster String implementation, so it would be great to get some feedback from the OP and others in this thread about whether they still see a performance issue or if this has been addressed now and can be closed.

diegozea · 2018-07-23T14:52:12Z

I found similar problems with the performance of functions that use eachline to iterate over the lines of a large file using Julia 0.6.4 and 0.7.0-beta2.81. Python is a lot faster than Julia in the following toy example. I think that part of the problem is that the eachline function is type unstable.

Example

Input

I've used this large file for the example: https://raw.githubusercontent.com/diegozea/mitos-benchmarks/master/data/PF00089.fasta

I do not perform string concatenation as in the previous example. I'm only using length to count the number of characters in each line (in order to do something inside the loop).

Julia 0.6.4

function countchars(filename)
    open(filename, "r") do fh
        c = 0
        for line in eachline(fh, chomp=false)
            c += length(line)
        end
        c
    end
end

Julia 0.7.0-beta2.81

function countchars(filename)
    open(filename, "r") do fh
        c = 0
        for line in eachline(fh, keep=true)
            c += length(line)
        end
        c
    end
end

Python 2.7.12 & Python 3.5.2

def countchars(filename):
    with open(filename, "r") as fh:
        c = 0
        for line in fh:
            c += len(line)
        return c

Timing

(after compilation in Julia)

Julia 0.6.4

julia> @time countchars("PF00089.fasta")
  0.220063 seconds (1.63 M allocations: 99.055 MiB, 4.79% gc time)
32301307

Julia 0.7.0-beta2.81

julia> @time countchars("PF00089.fasta")
  0.421870 seconds (1.63 M allocations: 57.543 MiB, 0.99% gc time)
32301307

Python 2.7.12

In [3]: %time countchars("PF00089.fasta")
CPU times: user 92.2 ms, sys: 12.3 ms, total: 104 ms
Wall time: 103 ms
Out[3]: 32301307

Python 3.5.2

In [3]: %time countchars("PF00089.fasta")
CPU times: user 154 ms, sys: 32.7 ms, total: 187 ms
Wall time: 174 ms
Out[3]: 32301307

Type instability of `eachline`

Using Traceur and code_warntype we can see that Julia 0.6.4 can't infer the returned type of eachline:

julia> @code_warntype countchars("PF00089.fasta")
Variables:
  #self# <optimized out>
  filename::String
  #5::##5#6

Body:
  begin 
      #5::##5#6 = $(Expr(:new, :(Main.##5#6)))
      SSAValue(2) = "r"
      return $(Expr(:invoke, MethodInstance for open(::##5#6, ::String, ::String), :(Main.open), :(#5), :(filename), SSAValue(2)))
  end::Any

julia> @trace countchars("PF00089.fasta")
(Core.kwfunc)(::Base.#eachline) at boot.jl:237
  returns Any
(Core.kwfunc)(::Type{EachLine}) at boot.jl:237
  returns Any
(Type)(::Array{Any,1}, ::Type{EachLine}, ::IOStream) at :0
  #temp# is assigned as Any at line 0
  ondone is assigned as Function at line 0
  ondone is assigned as Base.##208#211 at line 0
  dynamic dispatch to ((Core.getfield)($(QuoteNode(Core.Box(Base.#call#206))), :contents))(ondone, chomp, , stream) at line 0
  returns Any
(Base.#eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream) at :0
  #temp# is assigned as Any at line 0
(done)(::EachLine, ::Void) at io.jl:574
  dynamic dispatch to (Base.eof)((Core.getfield)(itr, :stream)) at line 574
(Core.kwfunc)(::Base.#readline) at boot.jl:237
  returns Any
(Base.#readline)(::Array{Any,1}, ::Base.#readline, ::IOStream) at :0
  #temp# is assigned as Any at line 0
(next)(::EachLine, ::Void) at io.jl:578
  dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)(itr, :chomp)))), Base.readline, (Core.getfield)(itr, :stream)) at line 578
  returns Tuple{Any,Void}
(#5)(::IOStream) at REPL[6]:3
  line is assigned as Any at line 4
  c is assigned as Int64 at line 3
  c is assigned as Any at line 5
  dynamic dispatch to (Base.eof)((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 4
  dynamic dispatch to ($(QuoteNode(Base.#readline)))($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), :((Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :chomp)))), Base.readline, (Core.getfield)($(Expr(:invoke, MethodInstance for (::Base.#kw##eachline)(::Array{Any,1}, ::Base.#eachline, ::IOStream), :($(QuoteNode(Base.#eachline))), :($(Expr(:invoke, MethodInstance for vector_any(::Any, ::Vararg{Any,N} where N), :(Base.vector_any), :(:chomp), false))), :(Main.eachline), :fh)), :stream)) at line 576
  dynamic dispatch to (Main.length)(line) at line 5
  dynamic dispatch to c + (Main.length)(line) at line 5
  returns Any

Julia's profiler also shows a lot of time being expended in readline and vector_any:

julia> function countchars(filename)
           open(filename, "r") do fh
               c = 0
               for line in eachline(fh, chomp=false)
                   c += length(line)
               end
               c
           end
       end
countchars (generic function with 1 method)

julia> Profile.print()
313 ./event.jl:73; (::Base.REPL.##1#2{Base.REPL.REPLBackend})()
 313 ./REPL.jl:97; macro expansion
  313 ./REPL.jl:66; eval_user_input(::Any, ::Base.REPL.REPLBackend)
   313 ./boot.jl:235; eval(::Module, ::Any)
    313 ./<missing>:?; anonymous
     313 ./profile.jl:23; macro expansion
      313 ./iostream.jl:152; open(::##1#2, ::String, ::String)
       212 ./REPL[1]:4; (::##1#2)(::IOStream)
        86 ./<missing>:0; (::Base.#kw##readline)(::Array{Any,1}, ::Base.#readline, ::IOStream)
         1  ./iostream.jl:0; #readline#216(::Bool, ::Function, ::IOStream)
         61 ./iostream.jl:234; #readline#216(::Bool, ::Function, ::IOStream)
        1  ./essentials.jl:0; vector_any(::Any, ::Vararg{Any,N} where N)
        72 ./essentials.jl:331; vector_any(::Any, ::Vararg{Any,N} where N)
        5  ./essentials.jl:333; vector_any(::Any, ::Vararg{Any,N} where N)
        1  ./essentials.jl:335; vector_any(::Any, ::Vararg{Any,N} where N)
        19 ./io.jl:574; done
         3 ./iostream.jl:74; eof(::IOStream)
       100 ./REPL[1]:5; (::##1#2)(::IOStream)
        6  ./strings/string.jl:164; length(::String)
        33 ./strings/string.jl:165; length(::String)

Using code_warntype in Julia 0.7.0-beta2.81 gives a similar output, so maybe it's the same problem in both versions of Julia:

julia> @code_warntype countchars("PF00089.fasta")
Body::Any
2 1 ─ %1 = invoke Base.:(#open#304)(:($(QuoteNode(Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}())))::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, open::Function, getfield(Main, Symbol("##7#8"))()::getfield(Main, Symbol("##7#8")), %%filename::String, "r"::Vararg{String,N} where N)::Any
  └──      return %1                                                                                                                                                                                                                │

I hope this information helps to solve the performance issue,

Best regards,

martinholters · 2018-07-24T07:27:23Z

The stream filed of the EachLine iterator has type IO:

julia/base/io.jl

Lines 845 to 846 in cbb6433

    
           struct EachLine 
        
               stream::IO

But:

julia> code_warntype(readline, Tuple{IO})
Body::String
429 1 ─      goto 2 if not false                                                                      │╻       #readline#304
    2 ┄ %2 = Base.getfield(%%s, :ios)::Array{UInt8,1}                                                 ││╻       getproperty
    │        Base.sle_int(0, 167772160)                                                               │││╻╷╷╷╷   convert
    │        Base.ifelse(true, 10, 0)                                                                 ││││╻       Type
    │   %5 = π (false, Bool)                                                                          ││      
    └──      goto 3 if not %5                                                                         ││      
    3 ┄ %7 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(2))))::Ptr{UInt8}nvert
    │   %8 = Base.bitcast(Ptr{Nothing}, %7)::Ptr{Nothing}                                             │││╻       convert
    │   %9 = :($(Expr(:foreigncall, :(:jl_readuntil), Ref{String}, svec(Ptr{Nothing}, UInt8, UInt8, UInt8), :(:ccall), 4, Core.SSAValue(8), 0x0a, 0x01, 0x02, 0x02, 0x01, 0x0a, Core.SSAValue(2))))::String
    └──      goto 4                                                                                   ││      
    4 ─      return %9                                                                                │       
Body::Any
369 1 ─ %1 = Base.:(#readline#276)(Base.nothing, false, %%#self#, %%s)::Any                                                   │
    └──      return %1

The second method that cannot be inferred here is this one:

julia/base/io.jl

Lines 368 to 382 in cbb6433

    
           function readline(s::IO=stdin; chomp=nothing, keep::Bool=false) 
        
               if chomp !== nothing 
        
                   keep = !chomp 
        
                   depwarn("The `chomp=$chomp` argument to `readline` is deprecated in favor of `keep=$keep`.", :readline) 
        
               end 
        
               line = readuntil(s, 0x0a, keep=true) 
        
               i = length(line) 
        
               if keep || i == 0 || line[i] != 0x0a 
        
                   return String(line) 
        
               elseif i < 2 || line[i-1] != 0x0d 
        
                   return String(resize!(line,i-1)) 
        
               else 
        
                   return String(resize!(line,i-2)) 
        
               end 
        
           end

The problem here seems to be that two of the readuntil methods (readuntil(io::Base.AbstractPipe, arg::UInt8; kw...) in Base at io.jl:232 and readuntil(this::Base.LibuvStream, c::UInt8; keep) in Base at stream.jl:769) are inferred as Any, and there are too many methods for String(::Any) for inference.
Possible solutions (not mutually exclusive):

Provide a type parameter to EachLine for the type of stream.
Add ::String type assertions to the String(...) calls.
Dig into the problematic readuntil methods and make them inferable (assuming they are actually type-stable).

KristofferC · 2018-07-24T08:03:42Z

Can confirm that putting IO as a type parameter on EachLine solves inferrability issues:

Body::Nothing
2 1 ── %1  = invoke Main.open(%%filename::String, "r"::String)::IOStream   │
4 │          invoke Core.kwfunc(Main.eachline::Any)                        │
  │    %3  = Base.slt_int(0, 1)::Bool                                      │╻╷╷╷╷   #eachline
  └───       goto 3 if not %3                                              ││┃│││    isempty
  2 ──       goto 4                                                        │││┃││     iterate
  3 ──       invoke Base.getindex(()::Tuple{}, 1::Int64)                   ││││┃│      iterate
  └───       unreachable                                                   │││││┃       iterate
  4 ──       goto 5                                                        │││││
  5 ──       goto 6                                                        │││╻       iterate
  6 ──       goto 7                                                        │││
  7 ──       invoke Core.kwfunc(Base.EachLine::Any)                        ││╻       #eachline#373
  │    %12 = Base.slt_int(0, 1)::Bool                                      │││╻╷╷╷╷   Type
  └───       goto 9 if not %12                                             ││││┃│││    isempty
  8 ──       goto 10                                                       │││││┃││     iterate
  9 ──       invoke Base.getindex(()::Tuple{}, 1::Int64)                   ││││││┃│      iterate
  └───       unreachable                                                   │││││││┃       iterate
  10 ─       goto 11                                                       │││││││
  11 ─       goto 12                                                       │││││╻       iterate
  12 ─       goto 13                                                       │││││
  13 ─ %20 = new(Base.EachLine{IOStream}, %1, getfield(Base, Symbol("##370#372"))(), true)::Base.EachLine{IOStream}
  └───       goto 14                                                       ││││
  14 ─       goto 15                                                       │││
  15 ─       goto 16                                                       ││
  16 ─ %24 = :(Base.iterate)::Core.Compiler.Const(iterate, false)          │
  │    %25 = invoke %24(%20::Base.EachLine{IOStream}, Base.nothing::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  │    %26 = Core.:(===)(%25, nothing)::Bool                               │
  │    %27 = Base.not_int(%26)::Bool                                       │
  └───       goto 19 if not %27                                            │
  17 ┄ %29 = φ (16 => 0, 18 => %35)::Int64                                 │
  │    %30 = φ (16 => %25, 18 => %36)::Union{Nothing, Tuple{String,Nothing}}
  │    %31 = π (%30, Tuple{String,Nothing})                                │
  │    %32 = Core.getfield(%31, 1)::String                                 │
  │    %33 = Core.getfield(%31, 2)::Core.Compiler.Const(nothing, false)    │
5 │    %34 = invoke Main.length(%32::String)::Int64                        │
  │    %35 = Base.add_int(%29, %34)::Int64                                 │╻       +
  │    %36 = invoke Base.iterate(%20::Base.EachLine{IOStream}, %33::Nothing)::Union{Nothing, Tuple{String,Nothing}}
  │    %37 = Core.:(===)(%36, nothing)::Bool                               │
  │    %38 = Base.not_int(%37)::Bool                                       │
  └───       goto 19 if not %38                                            │
  18 ─       goto 17                                                       │
8 19 ┄ %41 = Base.getfield(%1, :ios)::Array{UInt8,1}                       │╻╷      close
  │    %42 = :($(Expr(:foreigncall, :(:jl_array_ptr), Ptr{UInt8}, svec(Any), :(:ccall), 1, Core.SSAValue(41))))::Ptr{UInt8}
  │    %43 = Base.bitcast(Ptr{Nothing}, %42)::Ptr{Nothing}                 │││╻       convert
  │    %44 = :($(Expr(:foreigncall, :(:ios_close), Nothing, svec(Ptr{Nothing}), :(:ccall), 1, Core.SSAValue(43), Core.SSAValue(41))))::Nothing
  └───       return %44

and improves performance:

Without:

  0.282460 seconds (1.64 M allocations: 57.981 MiB, 1.74% gc time)

With:

0.064163 seconds (1.09 M allocations: 49.241 MiB, 5.91% gc time)

vtjnash · 2018-07-24T18:30:54Z

Something seems broken in codegen, since even if we annotate the return type, @profile seems to indicate it's still spending a large percentage of its time at runtime trying to figure out the return type of (Core.tuple)(%48, Base.nothing)::Tuple{String,Nothing} (from code_typed(iterate, (Base.EachLine, Nothing)))

diegozea · 2018-07-26T20:21:32Z

With #28253 (Julia 0.7.0-beta2.126) countchars("PF00089.fasta") takes 0.108174 seconds in my computer instead of 0.405204 seconds in Julia 0.7.0-beta2.8. So, now Julia is as fast as Python 2.7.12 (0.100 s) and faster than Python 3.5.2 (0.168 s).
I think the reason why we are comparable with Python 2.7.12 in this benchmark is that we are measuring the time spent in lenght(::String) (https://discourse.julialang.org/t/performance-of-length-string/12672):

julia> Profile.@profile countchars("PF00089.fasta")
32301307

julia> Profile.print()
2   ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
139 ./task.jl:262; (::getfield(REPL, Symbol("##28#29")){REPL.REPLBackend})()
 139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:119; macro expansion
  139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/REPL/src/REPL.jl:87; eval_user_input(::Any, ::REPL.REPLBackend)
   139 ./boot.jl:319; eval(::Module, ::Any)
    139 /home/zea/bin/Julia/usr/share/julia/stdlib/v0.7/Profile/src/Profile.jl:27; top-level scope
     139 ./REPL[1]:2; countchars
      139 ./iostream.jl:367; open
       139 ./iostream.jl:369; #open#304(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::getfield(Main, Symbol("##3#4"...
        139 ./REPL[1]:5; (::getfield(Main, Symbol("##3#4")))(::IOStream)
         4  ./io.jl:896; iterate(::Base.EachLine{IOStream}, ::Nothing)
          4 ./iostream.jl:193; eof
           3 ./pointer.jl:66; unsafe_convert
            3 ./pointer.jl:65; unsafe_convert
         80 ./io.jl:897; iterate(::Base.EachLine{IOStream}, ::Nothing)
          1  ./boot.jl:321; kwfunc(::Any)
          73 ./none:0; #readline
           73 ./iostream.jl:433; #readline#306
         53 ./strings/string.jl:269; length(::String)
          1  ./int.jl:428; length
          13 ./pointer.jl:0; length
          1  ./strings/string.jl:273; length
           1 ./strings/string.jl:88; codeunit
            1 ./gcutils.jl:87; macro expansion
             1 ./pointer.jl:105; unsafe_load
              1 ./pointer.jl:105; unsafe_load
          10 ./strings/string.jl:276; length
           10 ./int.jl:53; +
          18 ./strings/string.jl:277; length
          10 ./strings/string.jl:278; length
           10 ./strings/string.jl:88; codeunit
            10 ./gcutils.jl:87; macro expansion
             10 ./pointer.jl:105; unsafe_load
              10 ./pointer.jl:105; unsafe_load
         1  ./strings/string.jl:276; length(::String)

Counting lines instead of chars to not call length gives:

Julia 0.7.0-beta2.126

julia> function countlines(filename)
           open(filename, "r") do fh
               c = 0
               for line in eachline(fh, keep=true)
                   c += 1
               end
               c
           end
       end

julia> @time countlines("PF00089.fasta")
  0.059442 seconds (1.09 M allocations: 49.240 MiB, 6.86% gc time)
544104

Python

def countlines(filename):
    with open(filename, "r") as fh:
        c = 0
        for line in fh:
            c += 1
        return c

Python 2.7.12

In [2]: %time countlines("PF00089.fasta")
CPU times: user 56.8 ms, sys: 16.5 ms, total: 73.3 ms
Wall time: 72.7 ms
Out[2]: 544104

Python 3.5.2

In [5]: %time countlines("PF00089.fasta")
CPU times: user 123 ms, sys: 4.26 ms, total: 128 ms
Wall time: 126 ms
Out[5]: 544104

So, we are really iterating lines in a file faster than Python :)

KristofferC · 2018-10-07T19:32:50Z

In [2]: %time s = readfasta("chr1.fa")
CPU times: user 2.05 s, sys: 235 ms, total: 2.29 s
Wall time: 2.29 s

julia> @time readfasta("chr1.fa");
  2.090155 seconds (14.96 M allocations: 1.269 GiB, 19.81% gc time)

And using a but more idiomatic code

julia> function readfasta2(filename::String)
           io = IOBuffer()
           f = open(filename)
           l0 = readline(f)
           for line in eachline(f)
               print(io, chomp(line))
           end
           return String(take!(io))
        end

julia> @time readfasta2("chr1.fa");
  1.012338 seconds (14.96 M allocations: 865.145 MiB)

I don't think there is anything specific that warrants this issue to be kept open? Feel free to reopen if you disagree.

KristofferC closed this as completed Oct 28, 2016

JeffBezanson added performance Must go faster io Involving the I/O subsystem: libuv, read, write, etc. labels Oct 29, 2016

JeffBezanson reopened this Oct 29, 2016

jiahao added the needs more info Clarification or a reproducible example is required label Nov 1, 2016

JeffBezanson mentioned this issue Dec 21, 2016

faster String allocation #19449

Merged

loleg mentioned this issue Apr 17, 2018

Streaming large data reads frictionlessdata/TableSchema.jl#21

Open

KristofferC mentioned this issue Jul 24, 2018

use a type parameter for stream in EachLine, improves performance #28253

Merged

KristofferC closed this as completed Oct 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading very large text files #19141

Reading very large text files #19141

rsidd commented Oct 28, 2016

bicycle1885 commented Oct 28, 2016

JeffBezanson commented Oct 29, 2016

rsidd commented Oct 29, 2016

rsidd commented Oct 29, 2016

bicycle1885 commented Oct 29, 2016 •

edited

Loading

rsidd commented Oct 29, 2016

rsidd commented Oct 31, 2016

rsidd commented Nov 1, 2016

jiahao commented Nov 1, 2016

rsidd commented Nov 2, 2016

rsidd commented Nov 2, 2016

JeffBezanson commented Nov 2, 2016

sylvaticus commented Jul 12, 2017

oceanborn2 commented Dec 17, 2017

StefanKarpinski commented Dec 17, 2017 •

edited

Loading

diegozea commented Jul 23, 2018 •

edited

Loading

martinholters commented Jul 24, 2018

KristofferC commented Jul 24, 2018 •

edited

Loading

vtjnash commented Jul 24, 2018

diegozea commented Jul 26, 2018 •

edited

Loading

KristofferC commented Oct 7, 2018

Reading very large text files #19141

Reading very large text files #19141

Comments

rsidd commented Oct 28, 2016

bicycle1885 commented Oct 28, 2016

JeffBezanson commented Oct 29, 2016

rsidd commented Oct 29, 2016

rsidd commented Oct 29, 2016

bicycle1885 commented Oct 29, 2016 • edited Loading

rsidd commented Oct 29, 2016

rsidd commented Oct 31, 2016

rsidd commented Nov 1, 2016

jiahao commented Nov 1, 2016

rsidd commented Nov 2, 2016

rsidd commented Nov 2, 2016

JeffBezanson commented Nov 2, 2016

sylvaticus commented Jul 12, 2017

oceanborn2 commented Dec 17, 2017

StefanKarpinski commented Dec 17, 2017 • edited Loading

diegozea commented Jul 23, 2018 • edited Loading

Example

Input

Julia 0.6.4

Julia 0.7.0-beta2.81

Python 2.7.12 & Python 3.5.2

Timing

Julia 0.6.4

Julia 0.7.0-beta2.81

Python 2.7.12

Python 3.5.2

Type instability of eachline

martinholters commented Jul 24, 2018

KristofferC commented Jul 24, 2018 • edited Loading

vtjnash commented Jul 24, 2018

diegozea commented Jul 26, 2018 • edited Loading

Julia 0.7.0-beta2.126

Python

Python 2.7.12

Python 3.5.2

KristofferC commented Oct 7, 2018

bicycle1885 commented Oct 29, 2016 •

edited

Loading

StefanKarpinski commented Dec 17, 2017 •

edited

Loading

diegozea commented Jul 23, 2018 •

edited

Loading

Type instability of `eachline`

KristofferC commented Jul 24, 2018 •

edited

Loading

diegozea commented Jul 26, 2018 •

edited

Loading