-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minimal pass at getting cfunction to always run in the newest world #20167
Conversation
Packages can already be trivially fixed by adding |
Getting |
Most of the package won't hit those corner cases and #19784 should be enough and seems like it should be useful to have as a convenient function even if a different underlying implementation is used. In both short term (before unmanaged thread is supported) and long term, this causes a inconsistency of what code is running when a function is called on a managed vs unmanaged thread and also a race with only managed threads. |
I'm still against adding In both the short and long term this makes it solidly deterministic and simple for what code gets run. The only inconsistency is with what accidentally went into my original PR, where it incorrectly captures world age and pretends that it is a first class value. |
How is it consistent in short term? It turns into capture behavior on unmanaged thread. |
Only if it notices that loading pTLS would cause a segfault. It's invalid anyways, and not the first hack to keep it working. |
That's exactly where the inconsistency is.
Although that's exactly what this does. |
And it doesn't actually turn into capture, it just picks a random newer age that it knows has already been inferred. This is actually more legal and consistent than capture, due to the constraint that incurring a race condition in function definition must inherently imply a potential for it to use an older age than the actual newest (e.g. when the capture happens is permitted to be non-deterministic. In the future, we will tighten this further to permit external synchronization, but do I mention that this mode of execution is invalid? |
True. I'm aware that it can be used to emulate it, but it's intentionally a different model. |
What's "it"?
The world age or more precisely the set of available method is already user visible so capturing exactly that when
So it's an implicit way to change world which I think is bad in general since the user shouldn't do it by accident for most cases and it also requires dynamic dispatches to a case where the performance is generally believed to be predictably good (essentially the |
With the new foreigncall work, it also now possible to write backedge-free ccalls. I don't consider it a priority to rewrite threadcall right now, but just pointing out that codegen does have support now to emit a true dependency-free ccall thunk. |
Yes (confirmed by PkgEval), but that doesn't discount the importance of packages that need this (PyCall, Gtk, LibExpat are the big ones I know of) |
Yes, both must be visible, for inference, as well as decent error message. But that's not the same as being able to capture it. You can inspect back to any world age and compute the set of methods at that point. Confusingly, though, this historical view may not actually represent a past state of the system (usually due to compaction, or incremental compilation), so I don't recommend attempting too much with this. No, I must disagree about needing world changes to be explicit. While I expect the user will eventually need to be aware of how they can change, nothing else in the current implementation requires explicit changes, while there are many implicit changes. In the manual, I only highlighted the couple as the most likely example should to be encounted. |
Just to reiterate one of the primary design points: the world counter is present to compute correctness. It is not there to stop you from shooting your self in the foot by monkey patching something wrong. Thus, it gets captured by inference (and therefore also staged functions), and must get captured by Tasks (to make them inferrable & multi-threaded). It doesn't need to be captured by anything else. So that's it. Nothing else should be capturing them. |
And it allows optimizations, which this is undoing. |
nah, it just allows different optimizations (some of which haven't been written yet, but are required for closing other issues). keeping the current behavior, otoh, requires destroying the basic assumption underlying the-efficiency-of-compilation-in-the-presence-of-the-world-computation that storing a closure over a particular world value is impossible. I don't want to have to keep chasing down the correctness issues to maintain that. |
3accd0f
to
bfd9c7a
Compare
we are not really striving for accuracy or speed here (on the method invalidation path), just validity
#endif | ||
|
||
#if defined(USE_ORCJIT) || defined(USE_MCJIT) | ||
// make the alias name is valid for the current session | ||
jl_ExecutionEngine->addGlobalMapping(GA, (void*)(uintptr_t)Addr); | ||
// make the alias name is valid for the current session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're just changing the indentation here, but is this supposed to say "make sure the alias name is valid" or "make the alias name valid" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in this case that either would be valid corrections
Is it a fair summary of what this does to say it's equivalent to automating some of the eval-wrapping for you in the generated C callback function to fall back to dynamic dispatch when there would otherwise be world age issues? |
Yes. With a plan of later optimizing the dynamic fallback. |
I think it's reasonable to say that the function pointer will be used much more often than
This is only the case for the naive stopgap implementation.
What correctness issue? |
I'm not sure what you're arguing in the first paragraph. You seem to be simply agreeing that they both require some amount of lazy lookup and later optimization over the naive version. And as you point, the optimization and behavior of cfunction here should fairly closely match that required by an
How would you infer |
The optimizations that can be implemented with this approach isn't specific to this and is not an argument for doing this.
Only emit a cache and don't limit world range at all?
Inferring it should be fine? The |
With this PR, the optimization of runtime age is simply a static relocation to point to the method for the current world whenever that it changed (very rare). With the proposed current PR, it can't be statically relocated, since the runtime world isn't known during deserialization / static compilation, it can only get discovered at runtime. How do you optimize that? |
It can be done when the |
Where does the design goal that Duplicating the work into |
It's not the design goal. It's just a different way of saying the total cost of construct a
Can be skipped since it can know if the callee doesn't need it.
It's calling a method in a known world so no need to update pointer?
Sure. I think most C libraries uses C callback that does not pass structs around so it's mostly register
Never needed. Not now since it's not implemented yet and not in the future unless the callee needs to allocate.
The cache doesn't need a trampoline. Using trampoline in the slow path will be another optimization. |
yep, that optimization is the same in either case.
yes, my point is that it's never going to be zero-cost, so it's just a choice between "load from jl_world_counter" vs. "load from trampoline". And one of those is a trampoline...
the callee is certainly allowed to allocate
What do you call a runtime thunk if you aren't going to call it a trampoline? |
In that sense sure. I'm just saying that for
Not entirely. The
Loading and updating the world counter isn't my concern at all. The difference is that if it is to be executed in the latest world, the
Right. The point though is that this is always done in the callee and is not part of the |
You have a function pointer that captures a runtime value. Whether you do this via LLVM intrinsics or codegen, it's the same feature.
No, just like the rest of the system, it can assume that if the world counter invalidated a back-edge, it also updated the pointers to them. It only has to check and update the world counter if codegen reports that there is a dynamic user of it (i.e. a dynamic dispatch).
I don't follow. I'm proposing that the cfunction cache be world-independent so that it doesn't do a lookup, just a pointer load at a fixed location.
At the very least, we'll need to verify that a managed context exists for the thread. The |
I'm talking about the dispatch within the function returned by
Same as above, I'm talking about the lookup needed in the function returned by
Only if the |
JIT patching is probably the cheapest if we need to really optimize. It's a rare occurrence, so shouldn't happen very often. Most likely though, I would just go with a simple PLT.
Yes, but it's important to look at which world it's dependent on, since it's actually highly asymmetric. Except for return-type and inlining, it's cheaper and easier to depending on the latest world than to depend on the runtime world. I'm considering seeing if I reflect this better in the layout of TypeMap, since I realize that currently the API looks like it would be symmetric, while in fact is it computed as an half-open interval.
Agreed, this is in essence the same code-gen computation of whether the world-age state was needed. Once the cfunction thunk has picked a world (regardless of mechanism), it's effectively indistinguishable how we then execute that code, since it is at that point that all choices converge into running compiled code for a fixed world age.
... except for initializing a TLS context and and seamlessly handling the request? |
OK, I think this should be good enough to clear my performance concern and it sounds reasonably easy to implement. I assume you still need to load the latest world to be switched to (if needed by the function) in the function loaded from the PLT. It seems quite tricky to fix the race caused by
without atomic ops in the callback/plt, (if not possible an acquire load followed by a branch on the slot load again can probably work) (~10 cycles net overhead on aarch64, free on x86). I'm still not entirely convinced that always seeing the latest world in cfunction is the right semantics (partially because of the race) but that's a much smaller concern. |
I agree that the race there is quite tricky. We likely don't handle that correctly anywhere right now, since we lack the lock needed to ensure the invalidation finishes before the world-counter is globally incremented. I'm figuring the PLT can load the world counter first, then the slot (ensuring this ordering is free on x86, not necessarily true elsewhere), and then select Another option for avoiding that conditional move might be to do a two-stage update, first inserting a guard entry (a small assembly thunk which just wait for the world lock to be released then restarts), then does all the world invalidation computations and updates the counter, then puts in the real entry which allows the assembly thunk to resume. |
We are not really striving for accuracy or speed here (on the method invalidation path), just validity. That's also why there's no tests added here. I'm just trying to get some packages working again as quickly as possible, then will need to go back and make this much better.