Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SmallRyeThreadContext's close can use custom cleanup strategies #424

Open
franz1981 opened this issue Jun 16, 2023 · 19 comments
Open

SmallRyeThreadContext's close can use custom cleanup strategies #424

franz1981 opened this issue Jun 16, 2023 · 19 comments

Comments

@franz1981
Copy link

franz1981 commented Jun 16, 2023

I've spotted this In few reactive benchmarks using quarkus which doesn't disable context propagation (fairly commons for users that don't change default I believe): from what i can see there (and the code confirm it) the behaviour of SmallRyeThreadContext on close is to call ThreadLocal's removal, in order to save leaks to happen.

Sadly ThreadLocal::remove have a huge impact perf wise due to
https://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L570

which end up calling Reference::clear via

https://github.com/openjdk/jdk19/blob/967a28c3d85fdde6d5eb48aa0edd8f7597772469/src/java.base/share/classes/java/lang/ThreadLocal.java#L368

that can be seen in this flamegraph (in violet)
image

and at a deeper level
image

which show the interaction with the runtime to coordinate an async cleanup (it's a weak ref :"/ sob)

Given that such cost happen in the I/O thread it can be pretty severe and should be better saving it.

I order to improve this, would be great to be able to configure SmallRyeThreadContext via a lambda (once or more time) which can allow an external framework to perform some check to guide the close operation to perform a much cheaper set(null) (which doesn't remove the entry) or full fat remove as it is now: in the quarkus use case I think that if such lambda can query if the context of execution is within an I/O event loop thread (not just on a vertx ctx!), there won't be any need to remove the thread local entirely, because I/O threads will be alive for the whole application life-time.

@franz1981
Copy link
Author

Another (additional) option could be to let users provide their own Thread local like impl (with some public API) and quarkus can even decide to mix strategies:

  • by checking while running on I/O threads
  • replacing thread local with Netty's fast thread local ones (that would speedup A LOT usual thread local ops eg set/get)

@FroMage
Copy link
Contributor

FroMage commented Jun 16, 2023

I already have support for custom thread locals, but nobody's ever dared to use them.

So, why don't we always call set(null) instead of remove() then if it's much cheaper?

@franz1981
Copy link
Author

franz1981 commented Jun 16, 2023

Hi @FroMage this spring-cloud/spring-cloud-sleuth#27 (comment) explain in deep details what happen if remove isn't used

in short:

  • ThreadLocal instance put into the Thread's ThreadLocalMap an (weak reference) entry to itself
  • if the ctx propag ThreadLocal (which is a GC root in the class loader it belongs) won't be around anymore, such weak reference would drag for longer time (it's a weak ref!) the whole class loader too

In quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread referencing it (transitively, via the ThreadLocal).

The same reasoning can still hold to non I/O threads too, to be honest (so my previous statement seems inaccurate).

The sole difference of I/O threads vs blocking ones is that the former will last for the whole application duration as well (as the classloader and likely the context propagation usage too), hence the only impact is the presence of such ThreadLocal map entry ie a memory footprint negligible concern (few bytes).

What's dangerous, as mentioned in the issue I've referenced, is when classloaders life-time isn't the same of the application they belong, which can happen on wildfly, for example.

@FroMage
Copy link
Contributor

FroMage commented Jun 16, 2023

OK, so it could be an option. But the problem is that most ThreadContextProvider implementations are third-party plugins, so they'd also have to be updated to stop clearing the ThreadLocal.

Frankly I think this all should go via the new "Storage" API, but I don't have the time to drive this :(

@franz1981
Copy link
Author

franz1981 commented Jun 16, 2023

Why not using a sys property to decide which of the two behaviours?
This will let advanced users which knows their own classloader to communicate their (static const) preference without affecting the default behaviour (using remove)

@geoand
Copy link
Contributor

geoand commented Jun 19, 2023

In quarkus we don't care about it (please correct me if i'm wrong), because our class loader will be around for the whole application's life, meaning that we have no business into disconnect it's life duration from the one of the Thread referencing it (transitively, via the ThreadLocal).

Actually, this is not the case in dev-mode - in dev-mode user code is loaded via a ClassLoader that is dropped when the application is reloaded.
Same for tests actually.

@franz1981
Copy link
Author

That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove @geoand ?

user code is loaded via a ClassLoader that is dropped when the application is reloaded

Beware: it just means that the weak reference to the thread local map into the thread would still follow the weak reference GC rules and lifecycle ie they are just "temporary" leaks, meaning that will be around for more then necessary and would require both more heap and GC activity before cleaned up (likely full GCs).

Please let me know if you see other options in order to achieve it

@geoand
Copy link
Contributor

geoand commented Jun 19, 2023

That means that while compiled in dev/tests "mode" we have to disable such "feature", using TL::remove @geoand ?

Yes

@franz1981
Copy link
Author

franz1981 commented Jun 21, 2023

One note @FroMage it seems that Netty's FastThreadLocal is not making use of weak ref on the remove path, I'm evaluating if it's "good enough" for our use case

@geoand too -> meaning we can just use the "custom thread local" feature of this repo in quarkus and use fast thread locals from there, on both dev/test/prod scenarios (they should just work)

@geoand
Copy link
Contributor

geoand commented Jun 21, 2023

This confuses me, don't we already use FastThreadLocal as a result of using Vert.x?

@franz1981
Copy link
Author

franz1981 commented Jun 21, 2023

@geoand not for this specific one I can tell.

I've performed a micro benchmark to understand what are the others implications of NOT using fast thread locals for remove by using the benchmark at https://github.com/franz1981/java-puzzles/blob/main/src/main/java/red/hat/puzzles/concurrent/ThreadLocalScalability.java
with -DthreadLocals=1 on JDK 19.

-t 1

Benchmark                                     Mode  Cnt          Score        Error  Units
ThreadLocalScalability.setGetNullout         thrpt   10  109481209.044 ± 464386.775  ops/s
ThreadLocalScalability.setGetRemove          thrpt   10   14307019.589 ±  49020.979  ops/s

-prof perfnorm of the remove case:

Benchmark                                                   Mode  Cnt         Score        Error      Units
ThreadLocalScalability.setGetRemove                        thrpt   10  14063430.793 ± 265518.974      ops/s
ThreadLocalScalability.setGetRemove:CPI                    thrpt    2         0.443               clks/insn
ThreadLocalScalability.setGetRemove:IPC                    thrpt    2         2.255               insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses  thrpt    2         0.553                    #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses        thrpt    2         0.005                    #/op
ThreadLocalScalability.setGetRemove:LLC-loads              thrpt    2         0.031                    #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses       thrpt    2         0.020                    #/op
ThreadLocalScalability.setGetRemove:LLC-stores             thrpt    2         0.059                    #/op
ThreadLocalScalability.setGetRemove:cycles                 thrpt    2       184.114                    #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses       thrpt    2         0.009                    #/op
ThreadLocalScalability.setGetRemove:instructions           thrpt    2       415.255                    #/op

-t 4

Benchmark                                     Mode  Cnt          Score          Error  Units
ThreadLocalScalability.setGetNullout         thrpt   10  432425222.667 ±  7785949.059  ops/s
ThreadLocalScalability.setGetRemove          thrpt   10   38913911.834 ± 24150978.392  ops/s

-prof perfnorm of the remove case:

Benchmark                                                   Mode  Cnt         Score          Error      Units
ThreadLocalScalability.setGetRemove                        thrpt   10  38707328.433 ± 28134808.681      ops/s
ThreadLocalScalability.setGetRemove:CPI                    thrpt    2         0.808                 clks/insn
ThreadLocalScalability.setGetRemove:IPC                    thrpt    2         1.558                 insns/clk
ThreadLocalScalability.setGetRemove:L1-dcache-load-misses  thrpt    2         1.603                      #/op
ThreadLocalScalability.setGetRemove:LLC-load-misses        thrpt    2         0.002                      #/op
ThreadLocalScalability.setGetRemove:LLC-loads              thrpt    2         0.734                      #/op
ThreadLocalScalability.setGetRemove:LLC-store-misses       thrpt    2         0.020                      #/op
ThreadLocalScalability.setGetRemove:LLC-stores             thrpt    2         0.655                      #/op
ThreadLocalScalability.setGetRemove:cycles                 thrpt    2       338.906                      #/op
ThreadLocalScalability.setGetRemove:dTLB-load-misses       thrpt    2         0.453                      #/op
ThreadLocalScalability.setGetRemove:instructions           thrpt    2       417.949                      #/op

what the numbers shows is that setNull have near linear scalability (and perfnorm agree to have the same IPC for both single/four threads) while remove doesn't:
it is causing IPC to be halved and there's an increased activity of both L1 and LLC load-misses, likely the cause of such slowdown.
This is likely due to the interaction with the reference queue; in short, remove is not just 1 order of magnitude slower, but it can cause some scalability issue.

NOTE
In the bench I'm causing high contention over the same thread local instance: I'll try with many in a few to see if that change AND G1GC seems to be the most offended GC here, ParallelGC instead got a much reduced impact.

Going to check what happen with fast thread locals in a few.

@geoand
Copy link
Contributor

geoand commented Jun 21, 2023

That's a pretty big performance impact!

In Quarkus prod-mode, AFAIR, we use VertxThread for the worker pool as well, so we should always be using Netty threads, thus it makes complete sense to try and use FastThreadLocal

@franz1981
Copy link
Author

franz1981 commented Jun 21, 2023

Ok @Sanne and @geoand so the proper fixes are:

  • use FastThreadLocal(s) on quarkus for this AND use remove: this would grant less performance but no scalability impact (still better perf than JDK ThreadLocal(s) and no weak refs!)
  • use a sys prop to control if use remove or not (set(null) is just faster)

The 2 things could happen separately but ideally we want both

@geoand
Copy link
Contributor

geoand commented Jun 21, 2023

That sounds reasonable to me.

@Sanne
Copy link
Contributor

Sanne commented Jun 21, 2023

Awesome!
Ok we could implement this as two different flags, but I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).

@geoand
Copy link
Contributor

geoand commented Jun 21, 2023

I do wonder if we're not better off to inject a custom Quarkus implementation of some SPI which applies both tricks. (I understand such an SPI doesn't exist yet, so just wondering).

Yeah, I had the same thought, but I can't find where I wrote it :)

@FroMage
Copy link
Contributor

FroMage commented Jun 21, 2023

@franz1981 The original idea behind StorageDeclaration is for ThreadContextProvider to simply declare their TL storages to SR-CP (see https://github.com/smallrye/smallrye-context-propagation/blob/main/testsuite/extra/src/main/java/io/smallrye/context/test/MyThreadContextProvider.java) and then CP, instead of calling the providers, will simply call set on all the ThreadLocals and make context switching super fast.

Then, there's another thing designed to make thread locals super fast, which is that instead of handing out ThreadLocal instances to the libs, we collect all declared StorageDeclaration and create one field for each in custom threads, and override the Vert.x thread factory so that all Vert.x threads have extra fields. See https://github.com/smallrye/smallrye-context-propagation/blob/main/storage/src/main/java/io/smallrye/context/storage/spi/StorageManager.java

So the idea is that, for example, you have a lib like narayana, which needs a ThreadLocal for the current transaction:

  • Instead of doing ThreadLocal<Transaction> tl = new ThreadLocal<>() you do ThreadLocal<Transaction> tl = StorageManager.threadLocal(Declaration.class);
  • You declare static class Declaration implements StorageDeclaration<Transaction>{}
  • Quarkus collects all the known StorageDeclaration and creates a subclass of VertxThread: public class QuarkusVertxThread extends VertxThread { Transaction transaction; }
  • We register a ThreadFactory so that Vert.x creates those threads only
  • We create a ThreadLocal subtype:
class TransactionThreadLocal extends ThreadLocal<Transaction> { 
 public void set(Transaction tx) {
  Thread t = Thread.currentThread();
  if(t instanceof QuarkusVertxThread) {
    ((QuarkusVertxThread)t).transaction = tx;
  } else {
    super.set(tx);
  }
 }
 // … etc
}
  • And then context switching and thread locals become fast.

All this was already implemented years ago in branches, but nobody had time to test and benchmark it. I still don't have time, but I'd be happy to help you along :)

See:

Let me know if you're interested.

@franz1981
Copy link
Author

franz1981 commented Jun 21, 2023

Thanks @FroMage I will go through your comment to check if can solve this (avoid remove and using the faster set(null) for us: I wasn't fully convinced by the noise in the micro-benchmark and I have involved the Jdk team to review what I believe to have noticed related remove.
Anyway given that remove perform atomic operations and recreate the entry weak reference each time, it's still valid to move to a cheaper set (null) when possible ie prod class loader on Quarkus, regardless which thread local impl we choose.

@franz1981
Copy link
Author

I will likely experiment with a patch as you have suggested in the next weeks and it is indeed the best solution, perf-wise, although requires few code changes.

At the same time, Netty fast thread local is very similar but less invasive: differently from Java ThreadLocals, they would perform a simple array lookup in the FastThread instance.
And, without introducing any "additional" dependency (ie Netty), but leveraging an already existing one, it doesn't involve many changes in Quarkus (a one liner? Maybe 2-3).
Performing a micro-benchmark to compare it against a single field lookup solution should be trivial, will do it next week.

I will try schedule a call next week so we can decide together, given pro/cons of each solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants