Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing array memory #104

Open
Tokazama opened this issue Jan 7, 2021 · 9 comments
Open

Managing array memory #104

Tokazama opened this issue Jan 7, 2021 · 9 comments

Comments

@Tokazama
Copy link
Member

Tokazama commented Jan 7, 2021

The formal interface for working with an array's memory is currently very limited. We mostly just have ways of accessing preallocated memory (e.g., unsafe_wrap and unsafe_pointer_to_objref). The rest of the time we need to rely on methods like push!/pop!/append!/resize to mutate a Vector down the line. Likewise, if we want to allocate memory for a new array we typically need to wrap a new instance of Array.

A couple reasons we this would be good to have here

  1. Performance. The necessity to ensure that memory is safely managed also means that each call to something like resize! has to ensure that we don't create a situation where we are out of bounds. This probably only has minimal overhead in most cases but if we are doing something complicated like merging and or sorting two vectors then we may be calling resize! a lot. I recall this sort of thing coming up a lot when I was trying to work with graph algorithms last year that do a lot of insertion and deletion.
  2. Necessary for creating some interfaces. One of the things that makes the indexing interface so nice is that there are formally defined places for new types to insert and propagate information (e.g., bounds checking). It also makes it easier to optimize indexing. In contrast, Vector directly uses methods that allocate and write to memory (e.g., Base._growend! and Base.arrayset!) while abstract types typically hope that resize! works and then use setindex!.

From where I'm standing it seams like this would be great to have here. The reason this is an issue instead of a PR is mainly because I'm unsure if the implementation for this would be too general or involved for this package. We could always define several methods here like unsafe_grow_end! and unsafe_shrink_end! but these aren't super helpful without implementations that can interact with pointers and references.

@chriselrod, do you think growing/shrinking/allocating memory in an efficient way would require a bunch of LLVM magic or could we do it pretty simply here?

@chriselrod
Copy link
Collaborator

I'm not sure what exactly you have in mind.
When it comes to allocating memory, I've almost always used Julia APIs instead of going the llvm route (and the llvm route didn't work out, because alloca frees as soon as the function returns -- i.e., it's local to your llvmcall function, and cannot safely be used outside of it).
It's also of course important to consider the GC. We don't want to leak memory or do something that could trigger a crash.

If we really want to go the manual memory management route, I'd look at Libc which provides realloc to resize previously malloc or calloced memory.
So you could feasibly define an array type with a finalizer that calls Libc.free on the pointer, and that'd be mostly safe, except that you'd possibly* run into the issue the GPU ecosystem is facing in that they have to manually call GC.gc(), since it doesn't recognize just how much memory each GPU Array is consuming.

  • Does the GPU look to available RAM for memory pressure (and hence miss the GPU), use allocated bytes (and hence the malloced arrays on the CPU would suffer the same problem), or some combination thereof?

If you have specific use cases in mind, then making your own allocator would be more efficient than malloc (you could Mmap.mmap a huge array, and then your allocator would dole out and free that memory, with you writing code to do that in a way optimized for your specific program [use mmap for this because it doesn't consume ram by itself, only the memory you actually use from it will get paged in by your OS]). I think custom allocators should get their own package, though.

Would unsafe_grow_end! just call Base._growend!, but be an exported API function that other array types can then provide new methods for?

@Tokazama
Copy link
Member Author

Tokazama commented Jan 8, 2021

I think this JuliaLang/julia#24909 nicely illustrates the problem I'd like to be able to solve. I guess the long term solution would be having something like a memory buffer type like:

mutable struct MutableInt <: Integer
    x::Int
end

function unsafe_grow_end!(x::OptionallyStaticUnitRange{Int,MutableInt}, n::UInt)
    x.stop.x += n
end

struct CPUBuffer{T,D<:Ref{T},B,E}
    data::D
    indices::OptionallyStaticRange{B,E}
    flag::UInt16
end

function unsafe_grow_end!(x::CPUBuffer{T,D,B,E}, n::UInt) where {T,D,B,E<:MutableInt}
    @assert x.flag == 0  # flag that x owns data (not shared)
    unsafe_grow_end!(x.indices, n)
    Libc.realloc(x.data, length(x.indices))
    return nothing
end

But I'm assuming getting basic garbage collection working for this would be a nightmare and was hoping to meet half way for now with something like unsafe_grow_end!(x, n) = Base._grow_end! for now. Not sure if that's really possible in a semi-safe or effective way though.

@chriselrod
Copy link
Collaborator

Interesting, I hadn't seen that issue before. I assume you saw the short term solution: https://github.com/tpapp/PushVectors.jl
I'm curious if that'd help LoopVectorization's compile time performance (it uses a lot of push! while compiling). It's tricky of course, because the extra compilation of PushVectors.jl would need to pay for itself through the limited amount of faster push!-ing.
Something to look into.

From the issue, it sounds like the really long term solution for faster pushing is to move more of the Array implementation into Julia so we no longer need to ccall to do it.

Re garbage collection, making CPUBuffer a mutable struct would make the garbage collector track it, and then you just have to define a finalizer, like in oxinabox's comment.
It would have the GPUArrays problem of the GC not realizing how heavy the arrays actually are, though.

Still, it'd be interesting to benchmark it vs base to see just how slow malloc and realloc are.

@Tokazama
Copy link
Member Author

I'm curious if that'd help LoopVectorization's compile time performance

TBH, I hadn't given code gen a thought. However, push! loops certainly are a common pattern in code generating functions and it would be awesome to improve compile time across the board. I think we'll continue to see new ways to improve performance if this sort of interface is available to developers.

Re garbage collection, making CPUBuffer a mutable struct would make the garbage collector track it, and then you just have to define a finalizer, like in oxinabox's comment.
It would have the GPUArrays problem of the GC not realizing how heavy the arrays actually are, though.

Still, it'd be interesting to benchmark it vs base to see just how slow malloc and realloc are.

According to vtjnash malloc is very slow. You might also be interested in this comment. It appears that even the C implementation has some places with known room for improvement.

I'm starting to think we need to have a better way of talking to the GC and device memory before we can really solve this problem because the information necessary to make related methods possible is usually the same until we get down to the hardware. How nice would it be if we could just throw AMDPointer in the data field for the buffer since everything else is device dependent once you know where the memory is and details of who owns the buffer at run time.

developers have a way to easily interact with this I'm sure we'll continue to see new places we can improve performance in addition to push!.

I know that's sort of hand wavy, but it's difficult to put together

That sort of organic growth and optimization is what makes Julia great.

ere are a lot of things like this that come up where it should be possible to optimize something a bit more but we either come up with hacks like PushVector.jl or just assume it would be a mico-optimization not worth the amount of effort required.

@Tokazama
Copy link
Member Author

One of the biggest reasons I want this is b/c it would make it possible to have AxisIndices.jl to just provide per dimension traits to a memory buffer with something like:

struct NDArray{T,N,B,Axes<:Tuple{Vararg{Any,N}}} <: AbstractArray{T,N}
    buffer::B
    axes::Axes
end

This makes it so that NDArray can be dynamic, fixed, or statically sized depending on the buffer and type of axes.
It also makes it so we can have offset, centered, padded, and/or named axes without creating a bunch of other array types.

@Tokazama
Copy link
Member Author

Tokazama commented Apr 5, 2021

@chriselrod. Does the type info provide anymore information on how you generate code for vectorization beyond its size (a part from what vload needs to know for each element)? For example, if you were to perform sum on a buffer of 10, 8 bit elements and knew the device info. Or do you need to know about the bit mask?

@chriselrod
Copy link
Collaborator

Which type info? Just the element type, or everything, including size and strides?
This is what I get on an M1 Mac:

julia> using LoopVectorization, StaticArrays

julia> x = @MVector rand(10);

julia> function sumavx(A)
           s = zero(eltype(A))
           @avx for i  eachindex(A)
               s += A[i]
           end
           s
       end
sumavx (generic function with 1 method)

julia> @cn sumavx(x)
        .section        __TEXT,__text,regular,pure_instructions
        ldp     q0, q1, [x0]
        ldp     q2, q3, [x0, #32]
        ldr     q4, [x0, #64]
        fadd    v0.2d, v0.2d, v2.2d
        fadd    v1.2d, v1.2d, v3.2d
        fadd    v0.2d, v0.2d, v4.2d
        fadd    v0.2d, v1.2d, v0.2d
        faddp   d0, v0.2d
        ret

The asm is completely unrolled. The ARM chip has 128 it (2 x double) NEON registers.
The ldps each load 2 vectors, and the ldr loads the 5th. Then it fadds the 5 vectors together, before faddping them into a scalar.
Or the not nearly as nice looking avx512 version:

        .text
        mov     al, 3
        kmovd   k1, eax
        vmovupd zmm0 {k1} {z}, zmmword ptr [rdi + 64]
        vmovapd xmm0, xmm0
        vaddpd  zmm0, zmm0, zmmword ptr [rdi]
        vextractf64x4   ymm1, zmm0, 1
        vaddpd  zmm0, zmm0, zmm1
        vextractf128    xmm1, ymm0, 1
        vaddpd  xmm0, xmm0, xmm1
        vpermilpd       xmm1, xmm0, 1           # xmm1 = xmm0[1,0]
        vaddsd  xmm0, xmm0, xmm1
        vzeroupper
        ret
        nop     dword ptr [rax]

This would probably be faster if LoopVectorization used smaller vectors. Perhaps that's an optimization I should implement.
But instead, it is using 8 x double registers, which is awkward for 10 elements.
The first thing it does is mov a literal 3 to al, and then from eax (note eax is an alias for al, eax means 32 bit view of the a register, and al an 8-bit view). Then it kmovds the 3 into the k1 register, and finally uses the k1 to load from rdi + 64 (i.e., load 64 bytes -- or 8x floats from the start). 3 in binary is of course:

julia> bitstring(0x03)
"00000011"

meaning the first two are on, and the remaining vector lanes are off. That is, load only the first two elements of the vector, i.e. just elements 9 and 10.

So in both cases it used the exact size (in the latter case to create a bit mask).
Is that what you're asking?

It also needs to know the stride between elements.
I just noticed a bug in ArrayInterface.strides:

julia> A = @MMatrix rand(2,10);

julia> a = view(A,1,:);

julia> strides(a)
(2,)

julia> ArrayInterface.strides(a)
(static(1),)

Because of this, sumavx(a) gets the wrong answer.

@Tokazama
Copy link
Member Author

Tokazama commented Apr 5, 2021

I was just thinking of a linear set of elements. I'm trying to address this issue by creating something like LinearIndices or CartesianIndices but it actually communicates the transformations an index goes through between nested arrays. Once the indices are flattened out with stride info and offsets incorporated we could connect a lot of dots here.

For example, if we didn't need the physical buffer (just size info) to transform indices to an optimized state then we could stuff like this...

function sum(x)
    b, p = buffer_pointer(x)
    inds = to_index(x, :)
    @gc_preserve b, out = sum_pointer(p, inds)
    return out
end
function sum_pointer(p, inds)
    out = zero(eltype(p))
    @inbounds for i in inds
        out += p[i]
    end
end
function unsafe_get_collection(A, inds::AbstractRange)
    axis = to_axis(A, inds)

    bsrc, psrc = buffer_pointer(x)
    bdst, psrc = allocate_memory(A, axis)

    @gc_preserve bsrc, bdst copyto!(pdst, indices(axis), psrc, to_index(x, inds))

    return initialize(bdst, axis)
end

Then we could take advantage of some of the stride info here for transforming the indices and punt a bunch of the memory stuff off to the packages providing the buffer and pointer type which would define their own method for copyto! and sum_pointer.

Of course, all of this is moot if it would make what you're doing impossible.

@Tokazama
Copy link
Member Author

@chriselrod correct me if I'm wrong here, but it seems like the path forward is to have ArrayInterface depend on ManualMemory.jl (which I anticipate remaining a very light-weight dependency).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants