Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: add SubVector (fast 1d slices) #3224

Closed
wants to merge 1 commit into from
Closed

RFC: add SubVector (fast 1d slices) #3224

wants to merge 1 commit into from

Conversation

timholy
Copy link
Member

@timholy timholy commented May 28, 2013

Here's a baby step towards making Arrays just pointers + size & stride information. With it I'm getting ~2x - 2.5x speedups over mapslices in a recent application that looks a lot like this:

    notd = setdiff([1:nd], d)
    indexes[d] = 1:size(A, d)
    for c in Counter(szA[notd])
        indexes[notd] = c
        s = SubVector(Aout, indexes...)
        filt!(b, a, s)
    end

Counter comes from Grid.jl.

@ViralBShah
Copy link
Member

i have been wanting this for a while.

@StefanKarpinski
Copy link
Member

Very nice. Here's some relevant work I did recently, showing that any linear transformation of coordinates can be expressed as a linear transformation of their linear index, integer-divided by various products of dimensions:

# (dimensions, coordinates) => linear index
function lin(d::Vector{Int}, c::Vector{Int})
    @assert length(d) == length(c)
    l, p = 0, 1
    for i=1:length(d)
        @assert 0 <= c[i] < d[i]
        l += c[i]*p
        p *= d[i]
    end
    return l
end

# dot product in terms of linear index and dimensions
function lindot{T}(x::Vector{T}, d::Vector{Int}, l::Int)
    @assert length(x) == length(d)
    t, p = x[1]*l, d[1]
    for i = 2:length(x)
        t += (x[i]-x[i-1]*d[i-1])*div(l,p)
        p *= d[i]
    end
    return t
end

# test that lindot actually gives the dot product
for _ in 1:100000
    d = rand(1:10, rand(1:10))
    c = [ rand(0:v-1) for v=d ]
    x = rand(-10:10, length(d))
    l = lin(d,c)
    @assert lindot(x,d,l) == dot(x,c)
end

Note that in real code, the lindot loop code would ideally be unrolled for each number of dimensions, and in the 2d case, in particular would be just a few fast integer operations. (@JeffBezanson: This would be a great application of staged functions.)

Since array views are all expressible as linear transformations of input coordinates to memory offsets, this means that all array views can be linearly indexed fairly efficiently. It also means that integer/range slices of views can also always be represented as views since both operations are linear.

We should take a look at SubArrays in light of this, make them a bit more general while making linear indexing efficient, and consider making more widespread use of them. I think we might want to make all integer/range slicing operations produce views by default.

@JeffBezanson
Copy link
Member

This implementation is not safe --- a SubVector does not reference the original Array, so it can be GC'd out from under you. Hopefully we can make SubArray as fast as this.

@timholy
Copy link
Member Author

timholy commented May 28, 2013

Yes, you have to hold the array separately. I initially implemented a version that had a parent::Array{T} field, but for that version the immutable appeared to get constructed (rather than optimized away). Hence it was slow. I didn't know how to fix that.

An alternative I've used in the past is to use types (rather than immutables) and implement a reslice function. Because that won't have to allocate any memory, it can be reasonably fast. There may be some code doing that in Images.jl already.

If there's no good solution to the construction problem---or if this is simply too specialized on one dimension to end up in base---I can close this. It's already in Images.jl and can stay there quite happily :-). But I figured it was worth bringing up for discussion.

@JeffBezanson
Copy link
Member

I can probably improve the compiler to handle rooting of immutables more efficiently. We can't have the unsafe version in Base; manually keeping a reference to one object as long as another exists is not a workable API. It's equivalent to manual memory management. Actually much harder to do than keeping an object alive over a function call, as in many ccalls.

@ViralBShah
Copy link
Member

@JeffBezanson Can we improve the GC rooting here, or perhaps do special rooting of subarrays? At least where contiguous blocks of an array are extracted through getindex, it would be nice to return a subarray.

@ViralBShah
Copy link
Member

Some discussion of views in here, and @lindahua also is using similar ideas in NumericFunctors.

https://groups.google.com/forum/?fromgroups=#!topic/julia-dev/MIooix5yi3k

@ViralBShah
Copy link
Member

Also, @tanmaykm is doing similar things in ChainedVectors.

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

Even if we don't use immutables (for which the gc changes might take some heavy lifting, I imagine), @StefanKarpinski, it does seem like SubArray could be redesigned along the lines you propose. I also agree that a SubArray of a SubArray should be massaged to have the same Array parent with different slicing.

I'm not certain I fully understand the details of your lindot, but certainly the general principle seems great.

The key in views is to avoid allocations (which was the point of SubVector). For many iterative applications (e.g., performing an operation on each column), "reslicing" might be a pretty good solution: if you can reuse an existing SubArray and just change the view, you've accomplished the same thing. However, for reslicing our current SubArrays, there's one big hitch: indexes is a tuple, not a vector, and currently I don't see a way from a typing perspective to have it be anything else. So I think the SubArray type would have to be redesigned from the ground up. It would be a little ugly, but rather than using a tuple of RangeIndex, one could probably use 4 vectors (ugh): 3 to encode the start, step, and len elements of each dimension's Range, and (likely) an extra to encode whether any singleton dimension is subbed or sliceed. (Perhaps you might be able to get around the last one, I'm not sure.)

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

Of course, there's a problem with reslice: it doesn't play nicely with convenient syntax like A[2:2:end,4:7] unless the "reuse memory where applicable" problem is also solved. It does work for people who are willing to do these things manually, but I think that's not the long-term vision (and I agree with that perspective).

So probably the best is to shoot for the immutable version, because it makes everything else easy.

@lindahua
Copy link
Contributor

I wrote a benchmark script on Gist to compare the performances of different approaches:

Here is what I get:

Benchmark with long columns
with_ref!:  slow-down = 11.1970x     # use a[:,i]
with_sub!:  slow-down = 114.4806x    # use sub(a, 1:m, i)
with_sub1!: slow-down = 1.1974x      # use pointer_to_array
with_sub2!: slow-down = 2.5286x      # use immutable sub-vector

Benchmark with short columns
with_ref!:  slow-down = 261.4301x
with_sub!:  slow-down = 1256.6697x
with_sub1!: slow-down = 38.8170x
with_sub2!: slow-down = 19.0293x

We can observe:

  • The current version of sub, which creates SubArray is way too slow -- the overhead and indexing cost just dominate.
  • Using a[:, i] which makes copies also kills performance
  • Both immutable sub-vector and using pointer_to_array to create a shared-memory view work much better. That said, both lead to significant overhead when the columns are short.
  • Overall, the performance of creating a shared-memory array using pointer is better -- after all, you simply get the performance of builtin array. The limitation is that this applies only to contiguous blocks.

So, what about doing this?

  • For cases that always result in contiguous block, we simply return a shared-view memory array as like what I did in NumericFunctors.jl
  • For other cases, we use SubArray -- of course, this type needs to be reworked.

@lindahua lindahua mentioned this pull request Jun 22, 2013
@lindahua
Copy link
Contributor

Please move the discussion to #3496.

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

Since this point is SubVector-specific...did you profile the immutable subvector? If it was not getting constructed (which is how we want it to behave), there should be no overhead whatsoever with this approach. It's only when the compiler decides it actually has to build the SubVector that something needs to be allocated.

In contrast, the pointer_to_array approach does require an allocation (creation of a new array, with a pointer to shared data). So in principle, this has to be slower than the SubVector approach. (Unless I screwed up my implementation somehow, which is entirely possible.)

That said, it all depends on the state of the Julia compiler, at the time the tests are run.

@lindahua
Copy link
Contributor

@timholy I tried the one you proposed in the PR. It yields performance similar to my immutable subvector.

I posted the performance results in the table above. There you can see about 2x slow-down using this sub-vector as opposed to using the Julia array.

I suspect that the compiler (at current version) does not get the getindex methods inlined and it also does not elide the construction of the immutable sub-vector.

@lindahua
Copy link
Contributor

Having said that, I do believe this is the way to go. In #3496, I am considering extending the approach to create efficient immutable subarrays of various kind. The difference is that those types maintain the parent array instead of just the pointer, which would preclude the parent from being gc'ed during the life cycle of the subvector.

I think the 2x slow down of calling getindex of the sub-vector/sub-array will be addressed through improvement on the compiler part.

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

In my application, at the time, I was able to see that it was elided. As you say, there may have been compiler changes since then.

There's a tension between two approaches: "doing it in the way that should in principle be best, if the memory-safety issues can be resolved" (immutables) and "doing it in the way that works today" (like your #3496). For the record, I do lean towards the latter, but the performance results have to be put in a context where we are all aware of the more fundamental considerations.

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

I see we're both agreeing with each other while typing our responses!

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

Until the memory-safety issues are resolved, we can't go with the immutable version, unfortunately.

@timholy
Copy link
Member Author

timholy commented Jun 22, 2013

Ah, now I see you put the whole parent in. That does solve it, perhaps at the cost of performance. But much more sustainable.

@lindahua
Copy link
Contributor

In #3496

what I proposed is as below

immutable SubVec{A<:Abstract}
    parent::A
    dim::Int
    offset::Int
end

The only difference is that I maintain a parent array instead of the pointer, so there should be no problem of memory safety. I do see both solutions similar -- neither is in principle better than the other. Anyway, let's move the discussion to #3496.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants