Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write(io, ::Int) allocates #39041

Open
Seelengrab opened this issue Dec 30, 2020 · 23 comments
Open

write(io, ::Int) allocates #39041

Seelengrab opened this issue Dec 30, 2020 · 23 comments
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster

Comments

@Seelengrab
Copy link
Contributor

Seelengrab commented Dec 30, 2020

julia> p, io = mktemp()
("/tmp/jl_ry8Ly0", IOStream(<fd 21>))

julia> @time write(io, 1234)
  0.000006 seconds (1 allocation: 16 bytes)
8

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER =
  JULIA_NUM_THREADS = 4

Called in a tight loop, this quickly racks up allocations, overwhelming GC. This seems to be caused by the Int getting wrapped in a Ref further down the call chain, forcing it to be moved to the heap because of a @no_inline even further down the call chain.

Feels like primitve types in write shouldn't allocate.

@Seelengrab Seelengrab changed the title write(io, 1234) allocates write(io, ::Int) allocates Dec 30, 2020
@kwdye
Copy link

kwdye commented Dec 30, 2020

To give a little more color to the issue, here's an example of the GC becoming "overwhelmed"

function write_iobs(iobs, data)
    Threads.@threads for ii in 1:length(iobs)
        for jj in 1:length(data)
            write(iobs[ii], data[jj])
        end
    end
end

let
    num_ints = 10000000
    num_threads = 128

    data = rand(Int, num_ints)
    iobs = map(x->IOBuffer(Vector{UInt8}(undef, sizeof(Int) * num_ints), write=true), 1:num_threads)

    @time write_iobs(iobs, data)

    iobs
end

for me gives output:

 33.249133 seconds (1.28 G allocations: 19.078 GiB, 94.82% gc time)

Normally, when there are few threads, these tiny allocations are handled fine and you may not even notice. When there's a lot of them, however, it seems like the normal logic in the GC for handling tiny short lived allocations breaks. I think it may be related to this issue:

a8eaa22

@kwdye
Copy link

kwdye commented Dec 30, 2020

Any IO object that has its own internal buffer perhaps should be encouraged to implement something like this:

function write(io::IOBuffer, v::T) where {T <: Union{Int16,UInt16,Int32,UInt32,Int64,UInt64,Int128,UInt128,Float16,Float32,Float64}}
    re = reinterpret(T, io.data)
    re[1 + div((io.ptr - 1), sizeof(T))] = v
    io.ptr += sizeof(T)
    io.size += sizeof(T)
    return sizeof(T)
end

and it won't allocate. Really this would work for any immutable. TranscodingStreams.jl in particular would benefit from this.

Edit: Modulo the alignment issues when writing different length types!

@mgkuhn
Copy link
Contributor

mgkuhn commented Jan 4, 2021

Could the current allocating implementation

@noinline unsafe_write(s::IO, p::Ref{T}, n::Integer) where {T} =
    unsafe_write(s, unsafe_convert(Ref{T}, p)::Ptr, n) # mark noinline to ensure ref is gc-rooted somewhere (by the caller)
unsafe_write(s::IO, p::Ptr, n::Integer) = unsafe_write(s, convert(Ptr{UInt8}, p), convert(UInt, n))
write(s::IO, x::Ref{T}) where {T} = unsafe_write(s, x, Core.sizeof(T))
function write(s::IO, x::Union{Int16,UInt16,Int32,UInt32,Int64,UInt64,
                               Int128,UInt128,Float16,Float32,Float64})
    return write(s, Ref(x))
end

in base/io.jl perhaps be replaced with something along the lines of

[...]
function write(io::IO, v::T) where {T}
    if isbitstype(T)
        p = unsafe_convert(Ptr{UInt8}, v)
        return unsafe_write(io, p, Core.sizeof(T))
    else
        error("`write` is not supported on non-isbits type ", string(T))
    end
end

Then it wouldn't be necessary any more to wrap “user-defined plain-data types without write methods” into a Ref, as is currently suggested by the docstring of write.

(Although, unsafe_convert wasn't really meant for that.)

@mgkuhn mgkuhn added the io Involving the I/O subsystem: libuv, read, write, etc. label Jan 4, 2021
@Seelengrab
Copy link
Contributor Author

Sounds good to me. Maybe we should add

User-defined non-isbits types should define their own write methods.

to the error message as well.

Maybe an ArgumentError would also be better than plain error?

@simeonschaub
Copy link
Member

It's not possible to just convert a bitstype to a pointer like this, you need to put it in some kind of mutable container to get a stable memory address. Maybe we can reuse the same Ref, but I am not sure how difficult that would be to make thread safe. I also wouldn't be surprised if our GC was already optimized for small allocations like this, so it might be good to benchmark the benefit of reusing the Ref first.

@Seelengrab
Copy link
Contributor Author

Hmm.. I get the reasoning for general isbits types, but primitive types really shouldn't have to be wrapped in a Ref in the first place, right? Aren't they alwas stored inline?

@simeonschaub
Copy link
Member

Aren't they alwas stored inline?

Not always, but most of the time. That's exactly the problem, since that means it's not possible to get a stable pointer to them, but the way pipes work, the OS requires a pointer to the data. I don't think there's really any way around that requirement.

@kwdye
Copy link

kwdye commented Jan 4, 2021

Is there a way to perhaps create something that can iterate the bytes of a bitstype? Its easy to do with an Int or Float etc using bitshifts, so I guess this could be recursively done on simple structs as well. The issue would be handling holes and other alignment issues.

@simeonschaub
Copy link
Member

You still need a pointer one way or another

@kwdye
Copy link

kwdye commented Jan 4, 2021

What I mean is you only need the value of an Int to iterate all 8 of its bytes, you just shift and mask and cast the value.

@simeonschaub
Copy link
Member

Sure, but unsafe_write is not just like a generic Julia function since it interacts with the OS, so it needs a pointer not an arbitrary iterator.

@kwdye
Copy link

kwdye commented Jan 4, 2021

But many of the cases where unsafe_write is being called, it doesnt end up writing to a system pipe, e.g., the initial example here. All its doing is writing some bytes to a buffer in Julia space, right?

@simeonschaub
Copy link
Member

Yes, IOBuffer could potentially be special-cased here

@kwdye
Copy link

kwdye commented Jan 4, 2021

Could streams be given a trait that indicates that they have their own internal buffer, and dispatch unsafe_write on that? This would apply to IOBuffer and TranscodingStreams and probably many others.

@JeffBezanson JeffBezanson added the performance Must go faster label Jan 4, 2021
@mgkuhn
Copy link
Contributor

mgkuhn commented Jan 6, 2021

I suspect the real solution is that the compiler should recognize (possibly supported by a user hint?) that the result of the Ref(x) operation in

write(s::IO, x::Ref{T}) where {T} = unsafe_write(s, x, Core.sizeof(T))
function write(s::IO, x::Union{Int16,UInt16,Int32,UInt32,Int64,UInt64,
                               Int128,UInt128,Float16,Float32,Float64})
    return write(s, Ref(x))
end

is not required beyond the duration of the function call in which it happens (if the write(s, Ref(x)) call is inlined), and should therefore allocate a RAM address for storing x at compile time on the stack, where it then gets automatically freed implicitly when the function returns, and heap GC never gets involved.

I don't really understand under what conditions Julia currently allocates such addresses on the heap or on the stack, and how one can test what it does.

See also https://discourse.julialang.org/t/how-to-know-if-object-memory-resides-on-stack-or-heap/4927/12

@melonedo
Copy link
Contributor

melonedo commented Feb 19, 2021

As per definition, Ref is a pointer to an object managed by the garbage collector. But the garbage collector only manages heap memory, so a valid Ref must point to some region of the heap. Currently, Base has two types of reference, mutable RefValue and immutable RefArray (not exported). Obviously, creating a RefValue will allocate regardless of the wrapped type.
So converting an object to its reference will allocate. How about creating a Ptr directly? Since Int is an intrinsic type, it will probably be passed to the conversion function in a register, and there is no "pointer" at all. In C, the & operator allows you to take the risk and get the address of stack-allocated memory (local variables), which is dangerous (for e.g. return the address of a local variable).

Two approaches possible when a Ptr is indeed needed:

  • The user pre-allocate a Ref object and call write(::IO, ::Ref{Int}), and avoid calling write(::IO, ::Int). I am not sure but Base.RefArray also seems to work.
  • In the future, Julia provide some way to get the address of stack-allocated memory, in the form of a Ptr object, but the managed object's lifetime is only bound to some local scope. This is common practice in C like languages, gcc has __builtin_alloca, and llvm has alloca. Julia has such facility, but it was intentionally disabled here by introducing a function barrier with @noinline

For using the first approach, write_iobs2 only allocates 752 bytes regardless of iobs and data.

function write_iobs2(iobs, data)
    Threads.@threads for ii in 1:length(iobs)
        for jj in 1:length(data)
            write(iobs[ii], Base.RefArray(data, jj))
        end
    end
end

@Seelengrab
Copy link
Contributor Author

@aviatesk can we use the new effect system to help out here, to elide the allocation when it's not necessary?

@aviatesk
Copy link
Member

aviatesk commented Jun 15, 2022

I don't think so. I'm wondering if the @noinline annotation on unsafe_write is really necessary. Otherwise I think we can SROA the allocation of Ref object.

@noinline unsafe_write(s::IO, p::Ref{T}, n::Integer) where {T} =

EDIT: it's clarified by @melonedo here.

@melonedo
Copy link
Contributor

The heap allocation is necessary for context switch, and the @noinline is intentional to avoid stack allocation. So unless the underlying writing mechanism changes, some allocation is necessary. You can see more in the comments of my PR

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Jun 15, 2022

Is it conceivable to have a kind of persistent Ref for these kinds of calls per task? I don't mind having a heap address to read from, I just think it's really bad to create a new one for each call to write.

@KristofferC
Copy link
Member

A Is it conceivable to have a kind of persistent Ref for these kinds of calls per task?

Ref #41415

@Seelengrab
Copy link
Contributor Author

Maybe I've missed it, but that only seems to deal with print and show? Conceptually it's similar of course, but if

Most IO is done from a different task

anyway, can't that task have that persistent buffer instead of having it in each task? Maybe I'm misunderstanding what @vtjnash meant by that though, I'm not really deeply familiar with how julia does IO internally. I know that a lot of stuff goes through libuv (does this have its own IO task? How does that work, is there some documentation I could read up on for this?), while other stuff (file IO, as I understand it?) goes through ios.c.

@Seelengrab
Copy link
Contributor Author

Seems like I'm cursed to run into this again - encountered the fallback method while trying to make my UART subtype IO, due to an ambiguity on write(::UART, ::Base.BitInteger) with this method.. Solutions are either me defining the exact same method for all bitinteger types directly (ugly, as I can only write 5-9 bits at a time to this particular UART anyway), Base doing that, or Base not providing a fallback using Ref in the first place :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants