-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid jl_arrayunset in dicts with bitstypes; add some more @inbounds #30113
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good optimization. Also, doesn't jl_arrayunset
just write a NULL pointer value in the non-bits type case? Which we don't need to call a C function in order to do.
Afaik yes. In my own code, I would just But then, with the planned TBAA on arrays, I am not sure whether this would still be allowed in the future. Also, I'm not entirely sure about surprising assumptions about the object being rooted. So I left it in, for the sake of maintenance. @Keno ? Also, in cases where we have a union, might need to deal with selector bytes? Say, |
|
Cool. Then we could do a fast pure julia That would be a separate PR, though (would need to go through all callsites of |
You can also just do this in codegen. That s a much less breaking change. You can even get better than info that way |
By this you mean introducing a new intrinsic? I don't think I'm sufficiently proficient with codegen to do that quickly. It appears that the only julialang callsite of We could put a faster second implementation into
That one could be upgraded into an intrinsic or a better version later (one that correctly propagates aliasing info instead of being potentially suboptimal). But seeing that |
No. I mean https://github.com/JuliaLang/julia/pull/20890/files You can just follow the template. |
More specifically, untested patch. diff --git a/src/ccall.cpp b/src/ccall.cpp
index 0a9db2ca1d..4a6568e501 100644
--- a/src/ccall.cpp
+++ b/src/ccall.cpp
@@ -1764,6 +1764,29 @@ static jl_cgval_t emit_ccall(jl_codectx_t &ctx, jl_value_t **args, size_t nargs)
}
}
}
+ else if (is_libjulia_func(jl_arrayunset) &&
+ argv[1].typ == (jl_value_t*)jl_ulong_type) {
+ assert(!isVa && !llvmcall && nargt == 2 && !addressOf.at(0) && !addressOf.at(1));
+ jl_value_t *aryex = ccallarg(0);
+ const jl_cgval_t &aryv = argv[0];
+ const jl_cgval_t &idxv = argv[1];
+ jl_datatype_t *arydt = (jl_datatype_t*)jl_unwrap_unionall(aryv.typ);
+ if (jl_is_array_type(arydt)) {
+ jl_value_t *ety = jl_tparam0(arydt);
+ if (jl_array_store_unboxed(ety)) {
+ JL_GC_POP();
+ return ghostValue(jl_void_type);
+ }
+ else if (!jl_has_free_typevars(ety)) {
+ Value *idx = emit_unbox(ctx, T_size, idxv, (jl_value_t*)jl_ulong_type);
+ Value *arrayptr = emit_bitcast(ctx, emit_arrayptr(ctx, aryv, aryex), T_ppjlvalue);
+ Value *slot_addr = ctx.builder.CreateGEP(arrayptr, idx);
+ tbaa_decorate(tbaa_arraybuf, ctx.builder.CreateStore(V_null, slot_addr));
+ JL_GC_POP();
+ return ghostValue(jl_void_type);
+ }
+ }
+ }
else if (is_libjulia_func(jl_string_ptr)) {
assert(lrt == T_size);
assert(!isVa && !llvmcall && nargt == 1 && !addressOf.at(0)); edit: previous patch changed the wrong version of the copy........................... |
So if I look at this, we would lose bounds checking with the codegen variant. So that would be a minor change, just like the current As a second question, this mechanism could probably be used to inline the fast path of For that, I would need to generate code that pessimistically checks whether the very fast path is applicable, and otherwise calls into |
Yes. You can include the check if you want. Check
Yes for implementing, unclear for the speed up and it'll be slightly harder to implement since you have more branch to check even for the fast case. Unfortunately the branches in the fast path aren't inferable (if the data is shared). You can always give up whenever you don't like the input types (bitsunion for example) so you can start simple if needed.
Last time I checked the assembly code for |
Looking at the current version of the code, I won't be surprised if the bitunion code adds a lot of overhead. There are clearly a lot of inferrable branches in the fast path. In this case, I think you can just try sth like this, diff --git a/src/array.c b/src/array.c
index e058d185e4..228d5294f1 100644
--- a/src/array.c
+++ b/src/array.c
@@ -702,7 +702,7 @@ static size_t limit_overallocation(jl_array_t *a, size_t alen, size_t newlen, si
}
STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
- size_t n)
+ size_t n, int maybe_bitunion)
{
// designed to handle the case of growing and shrinking at both ends
if (__unlikely(a->flags.isshared)) {
@@ -722,7 +722,7 @@ STATIC_INLINE void jl_array_grow_at_beg(jl_array_t *a, size_t idx, size_t inc,
char *newdata;
char *typetagdata;
char *newtypetagdata;
- int isbitsunion = jl_array_isbitsunion(a);
+ int isbitsunion = maybe_bitunion && jl_array_isbitsunion(a);
if (isbitsunion) typetagdata = jl_array_typetagdata(a);
if (a->offset >= inc) {
// already have enough space in a->offset And pass in |
Sure, but the function call overhead is killing us. In principle, we should need two well-predicted branches (flag and capacity) and two increments. I measure ~10 cycles per fastpath
|
From my perspective this is ready to merge. I think that the travis-ci fail is spurious; is there a way to trigger a new attempt? Further improvements, i.e. inlining |
Would have been good to squash. |
No description provided.