avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

chethega · 2018-11-21T19:27:55Z

No description provided.

StefanKarpinski

Seems like a good optimization. Also, doesn't jl_arrayunset just write a NULL pointer value in the non-bits type case? Which we don't need to call a C function in order to do.

chethega · 2018-11-21T19:38:15Z

Also, doesn't jl_arrayunset just write a NULL pointer value in the non-bits type case? Which we don't need to call a C function in order to do.

Afaik yes. In my own code, I would just unsafe_store! a zero.

But then, with the planned TBAA on arrays, I am not sure whether this would still be allowed in the future. Also, I'm not entirely sure about surprising assumptions about the object being rooted. So I left it in, for the sake of maintenance. @Keno ?

Also, in cases where we have a union, might need to deal with selector bytes? Say, Union{Missing,Nothing}. So this looks more complicated.

Keno · 2018-11-21T20:10:43Z

unsafe_store! is always legal since it doesn't have TBAA information, just potentially suboptimal because it prevents the compiler from knowing whether it interferes with another store. No worse than a ccall of course.

chethega · 2018-11-21T20:26:38Z

Cool. Then we could do a fast pure julia array_unset that replaces the ccall by an unsafe_store!, is inlined away for bitstypes, and deals correctly with small bits unions (probably a nop also?).

That would be a separate PR, though (would need to go through all callsites of jl_arrayunset and document it as the preferred way of unsetting array items).

yuyichao · 2018-11-21T20:28:53Z

You can also just do this in codegen. That s a much less breaking change. You can even get better than info that way

chethega · 2018-11-21T21:37:10Z

You can also just do this in codegen. That s a much less breaking change. You can even get better than info that way

By this you mean introducing a new intrinsic? I don't think I'm sufficiently proficient with codegen to do that quickly.

It appears that the only julialang callsite of jl_arrayunset is in dict.jl. The only other callsite google could find is in DataStructures.jl. Obviously jl_arrayunset needs to stay in the C interface, because it is needed functionality.

We could put a faster second implementation into array.jl. This one would be a new feature (maybe documented and exported), like e.g.:

@propagate_inbounds function array_unset(A::Array{T}, idx::Int)
@checkbounds boundscheck(A, idx)
isbitstype(T) || isbitsunion(T) || return nothing
@gc_preserve A begin
ptr = convert(Ptr{Int}, pointer(A,i))
unsafe_store!(ptr, 0)
end
nothing
end

That one could be upgraded into an intrinsic or a better version later (one that correctly propagates aliasing info instead of being potentially suboptimal).

But seeing that jl_arrayunset is not used in a lot of user code, I'm not sure whether an intrinsic is worth it.

yuyichao · 2018-11-21T21:41:07Z

By this you mean introducing a new intrinsic?

No. I mean https://github.com/JuliaLang/julia/pull/20890/files

You can just follow the template.

yuyichao · 2018-11-21T21:49:21Z

More specifically, untested patch.

diff --git a/src/ccall.cpp b/src/ccall.cpp
index 0a9db2ca1d..4a6568e501 100644
--- a/src/ccall.cpp
+++ b/src/ccall.cpp
@@ -1764,6 +1764,29 @@ static jl_cgval_t emit_ccall(jl_codectx_t &ctx, jl_value_t **args, size_t nargs)
             }
         }
     }
+    else if (is_libjulia_func(jl_arrayunset) &&
+             argv[1].typ == (jl_value_t*)jl_ulong_type) {
+        assert(!isVa && !llvmcall && nargt == 2 && !addressOf.at(0) && !addressOf.at(1));
+        jl_value_t *aryex = ccallarg(0);
+        const jl_cgval_t &aryv = argv[0];
+        const jl_cgval_t &idxv = argv[1];
+        jl_datatype_t *arydt = (jl_datatype_t*)jl_unwrap_unionall(aryv.typ);
+        if (jl_is_array_type(arydt)) {
+            jl_value_t *ety = jl_tparam0(arydt);
+            if (jl_array_store_unboxed(ety)) {
+                JL_GC_POP();
+                return ghostValue(jl_void_type);
+            }
+            else if (!jl_has_free_typevars(ety)) {
+                Value *idx = emit_unbox(ctx, T_size, idxv, (jl_value_t*)jl_ulong_type);
+                Value *arrayptr = emit_bitcast(ctx, emit_arrayptr(ctx, aryv, aryex), T_ppjlvalue);
+                Value *slot_addr = ctx.builder.CreateGEP(arrayptr, idx);
+                tbaa_decorate(tbaa_arraybuf, ctx.builder.CreateStore(V_null, slot_addr));
+                JL_GC_POP();
+                return ghostValue(jl_void_type);
+            }
+        }
+    }
     else if (is_libjulia_func(jl_string_ptr)) {
         assert(lrt == T_size);
         assert(!isVa && !llvmcall && nargt == 1 && !addressOf.at(0));

edit: previous patch changed the wrong version of the copy...........................

chethega · 2018-11-21T23:50:53Z

So if I look at this, we would lose bounds checking with the codegen variant. So that would be a minor change, just like the current jl_array_isassigned checks bounds for the C-API but does not check bounds when called from inside julia? (since google knows only 2 callsites the amount of affected code is probably limited)

As a second question, this mechanism could probably be used to inline the fast path of jl_array_grow_end, right? That would significantly speed up push! and hence filter and collectof unknown length iterators. That would be a major performance improvement for a lot of code.

For that, I would need to generate code that pessimistically checks whether the very fast path is applicable, and otherwise calls into array.c. Maybe it would be simpler to have a jl_array_grow_end_fastpath that either grows the array and returns true, or returns false if the capacity is exceeded. Then it would be the job of array.jl to first attempt the fast growth, and otherwise call into array.c. This would duplicate the branches, but that shouldn't matter (we will at least pay a realloc, and the branch should be well-predicted because we only call into the slow path if the fast path failed).

yuyichao · 2018-11-22T00:08:07Z

So if I look at this, we would lose bounds checking with the codegen variant. So that would be a minor change, just like the current jl_array_isassigned checks bounds for the C-API but does not check bounds when called from inside julia? (since google knows only 2 callsites the amount of affected code is probably limited)

Yes. You can include the check if you want. Check emit_array_nd_index and it's caller if you want to do that. Pulling the bounds check out of the C API makes it easier to do bounds check elimination though.

As a second question, this mechanism could probably be used to inline the fast path of jl_array_grow_end, right? That would significantly speed up push! and hence filter and collectof unknown length iterators. That would be a major performance improvement for a lot of code.

Yes for implementing, unclear for the speed up and it'll be slightly harder to implement since you have more branch to check even for the fast case. Unfortunately the branches in the fast path aren't inferable (if the data is shared). You can always give up whenever you don't like the input types (bitsunion for example) so you can start simple if needed.

For that, I would need to generate code that pessimistically checks whether the very fast path is applicable, and otherwise calls into array.c. Maybe it would be simpler to have a jl_array_grow_end_fastpath that either grows the array and returns true, or returns false if the capacity is exceeded. Then it would be the job of array.jl to first attempt the fast growth, and otherwise call into array.c. This would duplicate the branches, but that shouldn't matter (we will at least pay a realloc, and the branch should be well-predicted because we only call into the slow path if the fast path failed).

Last time I checked the assembly code for jl_array_grow_* the code is well optimized for the fast path. It was before the bitsunion thing was added though. If the bitsunion code is what's slowing down the fast path then yes, that'll help. If you are only hit by function call overhead then it won't help since you still always have a function call. You also only need codegen support if you can't implement it in julia (don't have access to GC free-code/don't have a way to provide tbaa info etc) and if it's just about replacing a ccall with two you can just change the julia code.

yuyichao · 2018-11-22T00:19:04Z

If the bitsunion code is what's slowing down the fast path then yes, that'll help.

Looking at the current version of the code, I won't be surprised if the bitunion code adds a lot of overhead. There are clearly a lot of inferrable branches in the fast path. In this case, I think you can just try sth like this,

diff --git a/src/array.c b/src/array.c
index e058d185e4..228d5294f1 100644
--- a/src/array.c
+++ b/src/array.c
@@ -702,7 +702,7 @@ static size_t limit_overallocation(jl_array_t *a, size_t alen, size_t newlen, si
 }

 STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
-                                        size_t n)
+                                        size_t n, int maybe_bitunion)
 {
     // designed to handle the case of growing and shrinking at both ends
     if (__unlikely(a->flags.isshared)) {
@@ -722,7 +722,7 @@ STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
     char *newdata;
     char *typetagdata;
     char *newtypetagdata;
-    int isbitsunion = jl_array_isbitsunion(a);
+    int isbitsunion = maybe_bitunion && jl_array_isbitsunion(a);
     if (isbitsunion) typetagdata = jl_array_typetagdata(a);
     if (a->offset >= inc) {
         // already have enough space in a->offset

And pass in maybe_bitunion=0 in sth like a jl_array_grow_*_nobitsunion. You can swap the call either in the julia code or in codegen if you feel like it. The compiler constant propagation usually do a really good job here and this is the method that should have the least amount of code duplication (you don't need to reimplement a perfectly fine C function in either llvm IR or julia)

chethega · 2018-11-22T00:41:48Z

Last time I checked the assembly code for jl_array_grow_* the code is well optimized for the fast path.

Sure, but the function call overhead is killing us. In principle, we should need two well-predicted branches (flag and capacity) and two increments. I measure ~10 cycles per fastpath push!, which is pretty bad compared to what we should get when we inline this (probably will be limited by 1 branch/cycle retired in tight loops; maybe loops can even be rewritten such that the flag branch can be hoisted):

julia> using BenchmarkTools
julia> N=10_000; arr=collect(1:N);
julia> fp(arr,N) = (empty!(arr); for i=1:N push!(arr,i) end;)
julia> (@belapsed fp_test($arr, $N)) * 2 * 10^9 / N
12.58

chethega · 2018-11-29T15:56:57Z

From my perspective this is ready to merge.

I think that the travis-ci fail is spurious; is there a way to trigger a new attempt?

Further improvements, i.e. inlining ccall(:jl_arrayunset, ...) in codegen, can be a separate PR, with different trade-offs re maintainability and C-API stability.

KristofferC · 2018-12-06T20:56:42Z

Would have been good to squash.

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds

6d94a87

StefanKarpinski reviewed Nov 21, 2018

View reviewed changes

also skip jl_arrayunset for isbitsunion

72fdb29

amend typo

d7198bf

Keno approved these changes Nov 29, 2018

View reviewed changes

JeffBezanson merged commit 6bfe947 into JuliaLang:master Nov 30, 2018

KristofferC added performance Must go faster backport pending 1.0 labels Nov 30, 2018

KristofferC mentioned this pull request Dec 6, 2018

Backports for 1.0.3 #30010

Merged

61 tasks

fredrikekre removed the backport pending 1.0 label Dec 12, 2018

ararslan mentioned this pull request Jan 6, 2019

Regression from v1.0.2, causes crash on linux with v1.0.3 #30612

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

chethega commented Nov 21, 2018

StefanKarpinski left a comment

chethega commented Nov 21, 2018 •

edited

Loading

Keno commented Nov 21, 2018

chethega commented Nov 21, 2018

yuyichao commented Nov 21, 2018 •

edited

Loading

chethega commented Nov 21, 2018

yuyichao commented Nov 21, 2018

yuyichao commented Nov 21, 2018 •

edited

Loading

chethega commented Nov 21, 2018

yuyichao commented Nov 22, 2018

yuyichao commented Nov 22, 2018

chethega commented Nov 22, 2018

chethega commented Nov 29, 2018

KristofferC commented Dec 6, 2018

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113

Conversation

chethega commented Nov 21, 2018

StefanKarpinski left a comment

Choose a reason for hiding this comment

chethega commented Nov 21, 2018 • edited Loading

Keno commented Nov 21, 2018

chethega commented Nov 21, 2018

yuyichao commented Nov 21, 2018 • edited Loading

chethega commented Nov 21, 2018

yuyichao commented Nov 21, 2018

yuyichao commented Nov 21, 2018 • edited Loading

chethega commented Nov 21, 2018

yuyichao commented Nov 22, 2018

yuyichao commented Nov 22, 2018

chethega commented Nov 22, 2018

chethega commented Nov 29, 2018

KristofferC commented Dec 6, 2018

chethega commented Nov 21, 2018 •

edited

Loading

yuyichao commented Nov 21, 2018 •

edited

Loading

yuyichao commented Nov 21, 2018 •

edited

Loading