Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertx pooled allocator should be the same as Netty #5168

Closed
franz1981 opened this issue Mar 26, 2024 · 20 comments
Closed

Vertx pooled allocator should be the same as Netty #5168

franz1981 opened this issue Mar 26, 2024 · 20 comments
Labels

Comments

@franz1981
Copy link
Contributor

franz1981 commented Mar 26, 2024

Currently, in vertx

public static final ByteBufAllocator POOLED_ALLOCATOR = new PooledByteBufAllocator(true);
is NOT reusing the Netty's PooledByteBufAllocator.DEFAULT, causing creation of more thread-local direct buffers and arenas, enlarging the RSS footprint of vertx application, for no reason.

What's the reason behind this choice @vietj?

The reason why it should be changed, is to "ease" the life of users and libraries which allocate Netty direct buffers using the Netty one and can end up allocating new arenas because of this.

If the aforementioned pool re-use the Netty one, clearly is getting some additional contention, but will save memory, which seems a reasonable trade-off.

@franz1981 franz1981 added the bug label Mar 26, 2024
@zekronium
Copy link
Contributor

zekronium commented Jul 30, 2024

With this as is, impossible to use new adaptive allocator too.

@franz1981 this might affect quarkus too

Context:
netty/netty#13976

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 30, 2024

The reality seems more complex, see

bootstrap.childOption(ChannelOption.ALLOCATOR, PartialPooledByteBufAllocator.INSTANCE);
} else {
bootstrap.childOption(ChannelOption.ALLOCATOR, PooledByteBufAllocator.DEFAULT);

in short; with SSL we uses a different allocator (here -> https://github.com/eclipse-vertx/vert.x/blob/master/vertx-core/src/main/java/io/vertx/core/buffer/impl/PartialPooledByteBufAllocator.java) which uses the mentioned custom vertx allocator at #5168 (comment), but without SSL, instead, we just use the default Netty allocator.

To enable the adaptive one we should "at least" move into using a different entry point to setup the Netty configured "default" one i.e. https://github.com/netty/netty/blob/4.1/buffer/src/main/java/io/netty/buffer/ByteBufAllocator.java#L24

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 30, 2024

@vietj @cescoffier this is something to fix on quarkus as well i think

This can affect quarkus as well if we don't reuse the allocator configured on the Netty instance, which could be different from the singleton at PooledByteBufAllocator.DEFAULT, causing to allocate both (and pay twice the RSS).

e.g https://github.com/quarkusio/quarkus/blob/261cc877718ef24dd681cb1f3bb1547208535fca/independent-projects/vertx-utils/src/main/java/io/quarkus/vertx/utils/AppendBuffer.java#L136

@vietj @geoand
In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator?
I cannot see any chain of calls to get there, TBH.

@geoand
Copy link
Contributor

geoand commented Jul 30, 2024

In the case above what's the "idiomatic" way for quarkus to obtain the configured allocator?
I cannot see any chain of calls to get there, TBH.

I am not sure I follow this... What exactly are you after here?

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 30, 2024

Yep, so, in short:

  1. based on the SSL configuration Vertx can use its own PartialPooledByteBufAllocator.INSTANCE or the Netty PooledByteBufAllocator.DEFAULT
  2. Quarkus AppendBuffer (and other code paths) just use the Netty one i.e. PooledByteBufAllocator.DEFAULT
  3. If SSL is configured, the RSS of off-heap arenas double, because the vertx and quarkus direct allocators are different

To solve this, we should obtain from vertx itself which configured allocator it uses, so we won't duplicate the RSS, makes sense @geoand ?

@geoand
Copy link
Contributor

geoand commented Jul 30, 2024

Sure, that makes sense, but it won't really involve any Quarkus APIs or anything - we just need the extensions code do the proper thing

@franz1981
Copy link
Contributor Author

IDK @geoand probably requires some change in the API till Vertx, to make sure we can query the configured allocator to vertx, or you have a better idea in mind?
I'm very open, IDK yet how to solve it, TBH

@geoand
Copy link
Contributor

geoand commented Jul 30, 2024

Sounds reasonable

@franz1981
Copy link
Contributor Author

I've tried a different less invasive approach to this: see #5262

@franz1981
Copy link
Contributor Author

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

@zekronium
Copy link
Contributor

zekronium commented Jul 30, 2024

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

98% Vert.x user with BoringSSL (tc-native) and JDK 21 (unsafe enabled with opens and useReflection=true) on top of that too. Also run virtual threads in some projects, which based of the known benefits of the adaptive pooler might be interesting to see if it helps

In quarkus currently OpenSSL/BoringSSL is not supported per se from what I've seen in the issues, right?

EDIT:
Scratch the Virtual Thread part, if I recall correctly, Vert.x even with VT's performs all the IO on the eventloops and returns you (Unsafe)HeapBuffers, which are unpooled afaik in all cases, which this new allocator does not cover. So I think because the virtual threads wont actually interact with the allocator, the benefit of potentially better performance with VT's wont be measurable in my case if I got that right

@franz1981
Copy link
Contributor Author

franz1981 commented Jul 30, 2024

Exactly @zekronium - probably the only benefit you could have is if you do things on your own using the vertx allocator. - on VT.
And clearly in Quarkus, because we use it for few things (e.g. jackson json encoding)

@zekronium
Copy link
Contributor

@franz1981 Is Quarkus jackson json encoding different to the Vert.x one?

@franz1981
Copy link
Contributor Author

Yes @zekronium and will get better with time, like the wine :)
see quarkusio/quarkus#41063

where we bytecode generate the serializers to save reflection ;)

@zekronium
Copy link
Contributor

Ah, reminds me of jsoniter-dynamic, they do similar things but its old and unmaintained now. Waiting for backport :)

@zekronium
Copy link
Contributor

@zekronium are you a quarkus or vertx user? Just to understand If I ask your feedback if you want to give adaptive a shot :)

I have benchmarked our client application. Which method would you suggest to primarily look at as a common hotpath for both adaptive and default that would clearly direct comparison

For apples to apples, which is newDirectBuffer, which I think is something well to look at since it universally covers IO with our settings:

image image

This test was ran as a simple PoC with quite light load for 2 minutes. With it it seems to be about ~3x improvement!

These parts seem to consume slightly more CPU time, but might be by design?

image image

@franz1981
Copy link
Contributor Author

To better understand @zekronium :

  1. what's the type of benchmark you run?
  2. the throughput/rps for the 2 tests (without adaptive/with adaptive) was?
  3. what's the profiler you've used to get such numbers in Vertx pooled allocator should be the same as Netty #5168 (comment)? What's the meaning of these numbers/which resource is?
  4. any figure of the RSS/actual direct memory used?

@zekronium
Copy link
Contributor

This was a simple “feeler” test I ran using our request client for two minutes. I simply attached intelij profiler on my m1 max mac.

It was a realistic load run, about 1k rps out, 1k connections. Rough usage was about 2gb with unsafe.

I will comeback with a proper benchmark on proper hardware, but I hope this info is enough for now

@franz1981
Copy link
Contributor Author

franz1981 commented Aug 1, 2024

About your finding on the cost of buffer operations, it is expected, because the adaptive buffers piggyback to the usual Netty ones, but still way too much. I will run some micro on Netty itself to see if we can improve there and verify it.

franz1981 added a commit to franz1981/vert.x that referenced this issue Aug 28, 2024
franz1981 added a commit to franz1981/vert.x that referenced this issue Aug 29, 2024
franz1981 added a commit to franz1981/vert.x that referenced this issue Aug 29, 2024
@franz1981
Copy link
Contributor Author

Closed via #5292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants