Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize TLS access in sysimg. #14168

Merged
merged 1 commit into from
Nov 29, 2015
Merged

Optimize TLS access in sysimg. #14168

merged 1 commit into from
Nov 29, 2015

Conversation

yuyichao
Copy link
Contributor

Instead of letting the linker resolve jl_get_ptls_states to the wrapper function in libjulia.so, use our own symbol table (jl_sysimg_gvars) to resolve the address to the actual getter function at initialization time.

This saves one load and one jump for each tls address access.

Using the same benchmark I used in the PR that introduce jl_get_ptls_states. Only the gcframe (sysimg) and gcframe ones for threading on (shown as sysimg and JIT respectively) are included since nothing else is changed by this PR.

master threading this pr threading
sysimg 3.26 2.88
JIT 2.93 2.93

It's a little funny to see that the sysimg version is now consistently faster than the JIT version (by ~ 0.22 clock cycle...).

Comparing the assembly of the JIT version with function address inlined

   0x00007ffff7e11122 <+18>:    movq   $0x4,-0x30(%rbp)
   0x00007ffff7e1112a <+26>:    movabs $0x4018b0,%rax
   0x00007ffff7e11134 <+36>:    callq  *%rax

to the sysimg version with the function address loaded from the data section

   0x00007ffff1f111c2 <+18>:    mov    0x1847237(%rip),%rax        # 0x7ffff3758400 <jl_get_ptls_states.ptr>
   0x00007ffff1f111c9 <+25>:    movq   $0x4,-0x30(%rbp)
   0x00007ffff1f111d1 <+33>:    callq  *%rax

Maybe it's the instruction order and/or the difference in code size that matter?

@yuyichao
Copy link
Contributor Author

c.c. @vtjnash

@yuyichao yuyichao added performance Must go faster multithreading Base.Threads and related functionality labels Nov 27, 2015
@vtjnash
Copy link
Member

vtjnash commented Nov 27, 2015

lgtm

(appveyor failure looks unrelated due to triggering this assert https://github.com/JuliaLang/julia/blob/yyc/tls-opt/src/codegen.cpp#L948)

@yuyichao
Copy link
Contributor Author

appveyor failure looks unrelated

Yeah, it happened a few times recently and I've just restarted the build.

@yuyichao
Copy link
Contributor Author

I'm finally able to test what exactly can LLVM optimize when we are accessing tls variables in a control flow and it seems that LLVM can't hoist a const function call as good as I hoped.

The functions I tested with are

@eval f(b) = if b
    nothing
else
    $(Expr(:the_exception))
end

@eval g(n) = for i in 1:n
    $(Expr(:the_exception))
end

On the current master, llvm emit a call to jl_get_ptls_states in every loop for g. I'm now putting both the load of the function pointer and the call instruction to get the pointer to the tls struct at the beginning of the function and this is not a problem anymore. It seems that LLVM is also smart enough in this case to move the call to jl_get_ptls_states into the branch that actually need it. Maybe this is a better way to emit it anyway? (Since LLVM is probably pretty good at this kind of optimization given it's probably similar to the analysis mem2reg needs).

Clang doesn't have this problem with otherwise identical looking code and metadata (readonly, nounwind function call followed by a volatile load from the pointer returned) so maybe we are missing some passes for that? Probably not important anyway since it seems that we can work around this problem by improving codegen....

@yuyichao
Copy link
Contributor Author

Fixed none-MCJIT compilation. Local test passed with threading enabled on both LLVM 3.3 and 3.7.

* Instead of letting the linker resolve `jl_get_ptls_states` to the wrapper
  function in `libjulia.so`, use our own symbol table (`jl_sysimg_gvars`) to
  resolve the address to the actual getter function at initialization time.

* Emit call to `jl_get_ptls_states` at the beginning of the function.
  For some reason, LLVM couldn't hoist the function call out of the loop
  but is smart enough to move it down to a branch if it is not needed in
  other branches, or move it closer to where the result is actually needed.
@yuyichao
Copy link
Contributor Author

Added some comment since the ways we emit TLS references are a little complicated now....

@vtjnash Mind having a look again at this version? (especially about moving the call to jl_get_ptls_states to the beginning of the funciton.)

@vtjnash
Copy link
Member

vtjnash commented Nov 28, 2015

lgtm

yuyichao added a commit that referenced this pull request Nov 29, 2015
Optimize TLS access in sysimg.
@yuyichao yuyichao merged commit 0dd82f1 into master Nov 29, 2015
@yuyichao yuyichao deleted the yyc/tls-opt branch November 29, 2015 00:44
@tkelman
Copy link
Contributor

tkelman commented Nov 29, 2015

should we be changing the -ftls-model compiler flag at some point?

@yuyichao
Copy link
Contributor Author

From my understanding, it shouldn't make a difference anymore.

We currently provide two TLS variables, one in the julia-src (libjulia.so) and one in julia-ui (julia executable). They are both function scope static variables so accessing them outside a source or binary file (which AFAIK is another important factor that affect the tls model) is not relavant anymore.

The julia-src one is the fallback for embedding and it has to use a dynamic model. IIUC, global dynamic and local dynamic shouldn't make a difference anymore since there's only one variable (we are effectively doing local dynamic manually...).

The julia-ui one should use a static model and that's the whole point of having it. I'm not sure about other compilers (i.e. MSVC ;-p) but at least the GCC doc says that the compiler is free to choose a more effecient model than the one specified on the command line. I've verified locally that GCC will basically choose initial execution model for this variable no matter what is specified on the command line (edit: GCC 5.2 on Linux). It would be nice to check other compilers on other platform that we support and tweak the command line argument accordingly to make sure this is always the case.

@tkelman
Copy link
Contributor

tkelman commented Nov 29, 2015

yeah msvc ignores it and gives a warning which is one reason I'm asking. It might have an equivalent which is spelled differently, but it sounds like there are far fewer supported TLS models on windows so maybe not.

@yuyichao
Copy link
Contributor Author

Also note that the choice of tls model we should use as I mentioned above should be the default one anyway so we should be able to just remove it (the command line argument).

@vtjnash
Copy link
Member

vtjnash commented Nov 29, 2015

windows only really supports IE, so msvc is probably just going to complain that the the flag is meaningless / impossible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants