Optimize SimpleSubscriber for Netty #3583

kciesielski · 2024-03-08T15:35:44Z

This PR updates the SimpleSubscriber with following improvements:

If there is only one chunk of data (by default for body length < 8192), read it from the buffer and return immediately, requiring only one array allocation
If there are multiple chunks, collect all Netty ByteBufs without rewriting them into arrays, then copy them into the final array in onComplete. Disclaimer: initially I wanted to cover this case with Netty's CompositeByteBuf, but it turned out to be bad idea. It does some reallocations to resize internal representation underneath, resulting in making the overall performance significantly worse.

Results:
PostBytes Simulation

	Before	After
CPU samples	30.86%	29.73%
throughput	44738	56955

Latency improvement:

PostLongBytes Simulation

	Before	After
CPU samples	48.8%	38.69%
throughput	363	454

Latency improvement:

I haven't measured it, but there should be also noteworthy gains in memory allocations.

adamw · 2024-03-08T21:14:28Z

...erver/src/main/scala/sttp/tapir/server/netty/internal/reactivestreams/SimpleSubscriber.scala

+      val finalArray = ByteBufUtil.getBytes(byteBuf)
+      byteBuf.release()
+      resultBlockingQueue.add(Right(finalArray))
+      resultPromise.success(finalArray)


shouldn't the subscriber enter into some state, where another arriving content would report an exception? it's possible that the incoming header is malformed

I added a handler for this case - now if the result queue.offer call returns false, we follow up with canceling the subscription - as far as I checked that's the recommended way to 'abort'.

adamw · 2024-03-08T21:17:07Z

...erver/src/main/scala/sttp/tapir/server/netty/internal/reactivestreams/SimpleSubscriber.scala

  private val resultBlockingQueue = new LinkedBlockingQueue[Either[Throwable, Array[Byte]]]()
+  private val buffers = new ConcurrentLinkedQueue[ByteBuf]()


actually ... what are the concurrency guarantees for a subscriber - can multiple onNext be called concurrently? or maybe onNext + onError? I'm wondering if we (a) need a concurrent data structure here at all and (b) if concurrency is allowed, is the impl safe

The onNext, onError, and onComplete operations are guaranteed to be called sequentially without concurrency, so we are safe to replace the concurrent data structure with something simpler. I tried with a ListBuffer and it actually gave another noticeable boost to throughput and latency for PostLongBytes.

adamw · 2024-03-08T21:20:20Z

Nice results! :)

adamw · 2024-03-09T12:42:44Z

...erver/src/main/scala/sttp/tapir/server/netty/internal/reactivestreams/SimpleSubscriber.scala

 import scala.concurrent.{Future, Promise}

 private[netty] class SimpleSubscriber(contentLength: Option[Int]) extends PromisingSubscriber[Array[Byte], HttpContent] {
  private var subscription: Subscription = _
  private val resultPromise = Promise[Array[Byte]]()
  private var totalLength = 0
  private val resultBlockingQueue = new LinkedBlockingQueue[Either[Throwable, Array[Byte]]]()
-  private val buffers = new ConcurrentLinkedQueue[ByteBuf]()
+  private val buffers = new mutable.ListBuffer[ByteBuf]()


follow-up question (sorry ;) ) - onNext/onComplete/onError are guaranteed to be called from one thread, but is it going to be the same thread? that is, does buffers need to be volatile?

We don't have such a guarantee, and maybe that was why I had a ConcurrentLinkedQueue for byte arrays in the previous implementation, but I forgot :) This means I either fall back to it, or use a volatile ListBuffer.
I guess var totalLength is also unsafe and should be converted to an AtomicInteger, right?

Update: You can't have a volatile val, so I fell back to ConcurrentLinkedQueue. For totalLength I chose @volatile instead of AtomicInteger, because we are using this variable sequentially, our only scenario is increasing it in onNext and reading in onComplete.

but we can have a volatile var with an immutable list - always less synchronisations

Ok, I replaced the ConcurrentLinkedQueue with a volatile var ListBuffer. In theory, ListBuffer should have slightly cheaper append time than a Vector, and we don't need Vector's fast random access, which takes additional cost of maintaining more complex underlying structure.
I don't see any improvement of throughput, but there's a slight improvement in latency.

I'm not sure this properly protects the value - ListBuffer is mutable, and the volatile only ensures there's a memory barrier before reading the buffers reference (not references inside the ListBuffer). But maybe since we have a memory barrier, everything will be synchronized correctly ... as you can't really access the inner references before first reading the buffers reference (which creates the barrier).

Anyway, I though about a simpler design, using immutable data structures, where you don't have to think that much ;-) But maybe this one works as well :)

kciesielski added 3 commits March 8, 2024 15:55

Optimize SimpleSubscriber

52f03d5

refactor

d1d136c

Organize imports

b875bf0

kciesielski added the Netty label Mar 8, 2024

Adjust netty-zio and netty-loom

a673bc5

kciesielski marked this pull request as ready for review March 8, 2024 16:01

kciesielski requested a review from adamw March 8, 2024 16:01

adamw reviewed Mar 8, 2024

View reviewed changes

Use a ListBuffer

c9b2394

adamw reviewed Mar 9, 2024

View reviewed changes

kciesielski added 4 commits March 9, 2024 13:46

Handle unexpected bytes if a single response has been already returned

2c0935f

Concurrency-related fixes

30c6610

Use a ListBuffer instead of ConcurrentLinkedQueue

57484a9

Use a Vector

be27ae9

kciesielski merged commit 7301c06 into master Mar 11, 2024
28 checks passed

kciesielski deleted the perf-netty-subscriber branch March 11, 2024 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SimpleSubscriber for Netty #3583

Optimize SimpleSubscriber for Netty #3583

kciesielski commented Mar 8, 2024 •

edited

Loading

adamw Mar 8, 2024

kciesielski Mar 9, 2024

adamw Mar 8, 2024

kciesielski Mar 9, 2024

adamw commented Mar 8, 2024

adamw Mar 9, 2024

kciesielski Mar 9, 2024

kciesielski Mar 9, 2024

adamw Mar 9, 2024

kciesielski Mar 11, 2024

adamw Mar 11, 2024

		private val resultBlockingQueue = new LinkedBlockingQueue[Either[Throwable, Array[Byte]]]()
		private val buffers = new ConcurrentLinkedQueue[ByteBuf]()

Optimize SimpleSubscriber for Netty #3583

Optimize SimpleSubscriber for Netty #3583

Conversation

kciesielski commented Mar 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Mar 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kciesielski commented Mar 8, 2024 •

edited

Loading