jdkhttp-server: Write multipart parts bigger than threshold to files #3227

jnatten · 2023-10-06T14:45:06Z

Like @adamw predicted in this #3132 (comment) it would be smart to write big parts to temporary files rather than keeping it in memory no matter what (We just ran into some heap oom's on a small application which handles some uploading 😄)

This PR is an attempt at implementing this.
Not the greatest at this kind of stuff, so feel free to go hard on the feedback 😄

jnatten · 2023-10-06T14:49:11Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+    recursivelyParseState(ParseData(empty, List.empty, Default)).completedParts
+  }
+
+  private val TempFileThreshold = 52_428_800


Should this threshold be configurable somehow?
I got this number from http4s, but input is appreciated 😄

So we'll read 52MB of data before falling back to a file? Seems quite a lot ;) We could make this configurable via JdkHttpServerOptions, but I'm not sure if it's worth the effort. A reasonable default might do for now.

Doesn't seem to be too much work if i can pass it the same way createFile is passed through the JdkHttpRequestBody constructor?

Yeah, maybe let's try, then we can also quite easily add a test which would try to send large parts

jnatten · 2023-10-06T14:51:16Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+    @tailrec
+    def recursivelyParseBodyBytes(outputStream: PartStream, lastXBytes: Array[Byte], numReadBytes: Int): InputStream = {
+      val currentByte = is.read()
+      if (currentByte == -1) throw new IllegalArgumentException("Parsing multipart failed, ran out of bytes before finding boundary")


This isn't needed to pass the tests, but i guess some infinite loop could occur if some invalid body was passed. Is this an okay way to handle that?

Yes, exceptions here are fine. I'm not sure if IAE is the correct one here, though, it's not that an invalid argument was passed, but rather that the body is malformed. Maybe simply a RuntimeException?

What would be the resulting error of a RuntimeException?
Wouldn't it make most sense if the resulting error is a 400 or 422 if its because of a malformed body?

Hm yeah currently any exception will end up returning a 500. Maybe let's create a separate issue to provide a built-in exception for signalling a specific result, in case it's not possible to communicate this in another way?

I'm not sure there is another way? I assume the http4s and others use the failure part of the effect, but the "effect" in jdkhttp-server doesn't really offer a way we could handle this as far as i could tell.

What we could do is add a RespondWithException, which

(a) would contain the status code and message to return to the user
(b) would be handled by the standard ExceptionInterceptor

then we could use it in situations as those

jnatten · 2023-10-06T14:53:06Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

    Iterator
-      .continually(reader.readLine())


While this previous way felt more readable, i couldn't figure out a good way to "chunk" the data up when determining whether the threshold was passed or not when reading lines (What is the body doesn't have any newlines), so i think the new way is better(?)

I think here a good-old for loop with two mutable vars for last two bytes + a ByteArrayOutputStream for accumulating the values would be much more performant

adamw · 2023-10-10T15:21:10Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+      new FileInputStream(f)
+  }
+
+  private def readUntilNewline(inputStream: InputStream): Array[Byte] =


I'm just starting to read through the code, but doesn't this mean that if we get a line which is over 52MB, we'll read it into memory anyway? What if there are no newlines in the part data for a long time?

The readUntilNewline and readStringUntilNewline functions are only called in places where there is expected to be a newline in a few bytes if the body is correctly formed.
When parsing headers, when
We could have a problem with a malformed body i suppose, I'll see if i can figure out a way to either throw an error in those cases or rework the parsing somehow 👍

Sounds good, maybe simply a hard limit - something reasonable that would cover the max boundary / header size?

Max boundary size is 70 characters (+ a few for delimiters) according to the rfc.
The headers doesn't really have a specced max-size as far as i could tell, but i saw a few defaulting to 4kb and 8kb max sizes for normal http headers, so i don't think something like that for a max size sounds like a terrible idea.

Yeah 8KB sounds very reasonable

adamw · 2023-10-11T10:15:13Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+      val partStream = outputStream.convertToFileIfThresholdMet(numReadBytes)
+      partStream.write(currentByte)
+
+      val updatedLastXBytes =


While this is quite elegant in code, it might not perform that well: for each byte read, we'll allocate a new byte array with the last boundary bytes. I think a better solution would be to keep a circular buffer with the boundary bytes - moving the index where we store the next byte, and comparing the contents from the subsequent index (modulo size).

Another problem might be comparing the boundary with the current boundary buffer for each byte - it might also be a performance penalty. So I think we could keep track of how many bytes matched so far, and only check if the next one matches? One problem here would be that in case of a mismatch, we would have to try restarting the process for bytes that have already been put in the buffer. Definitely non-trivial logic, and would need some good unit-test coverage.

Going a bit further, we might delay writing bytes to the output until we are sure they are not part of the boundary, avoiding the need of truncating the streams in the end.

That sounds like a good idea. I'll try to implement all of these suggestions. It'll probably take me some time, but you're obviously right about the performance benefits 😄

adamw · 2023-10-11T10:16:54Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+    def convertToFileIfThresholdMet(numReadBytes: Int): PartStream = this match {
+      case ByteStream(os) if numReadBytes >= TempFileThreshold =>
+        val newFile = createTempFile()
+        val fileOutputStream = new FileOutputStream(newFile)


maybe use a BufferedOutputStream to avoid going to the disk for each byte?

Sounds like a good idea as well 👍

adamw · 2023-10-11T10:19:53Z

Nicely written - code was easy to follow :) I left some suggestions in the comments

jnatten · 2023-10-17T15:43:19Z

Just pushed some changes doing mostly everything we talked about, except the exception returning errors, but that could probably be a separate PR?

I'm not too sure about the factoring. Tried to keep the mutable parts to the ParseState class.
Also not sure where to put the tests you mentioned regarding the boundary parsing and the large part.

Also wondering some git stuff: Would you like me to keep pushing to this with commits like this, and for you to handle squash/merge however you like. Or would you like me to rebase the PR to a single commit?

jnatten · 2023-10-17T20:26:30Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/CircularBuffer.scala

+
+import scala.collection.IndexedSeqView
+
+class CircularBuffer(bufferSize: Int) {


Not super sure if this needed to be a class or if i should just do something inline. Or even if i should use the circular buffer for everything like i did.

jnatten · 2023-10-17T20:33:39Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/CircularBuffer.scala

+
+    val b1 =
+      if (readBytes >= bufferSize) underlying.view.slice(idx, bufferSize)
+      else underlying.view.slice(0, 0)


This feels a little weird.
But not sure if i love Array.empty[Byte].view either.
What do you think?

jnatten · 2023-10-30T11:00:06Z

Sorry to nag, but anymore feedback on this @adamw ? 😄

adamw · 2023-10-30T13:34:24Z

@jnatten I'm very sorry, got swamped by a long queue of work, looking now :)

adamw · 2023-10-30T13:57:04Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+    }
+
+    private def lookForBoundary(currentByte: Int, boundary: Array[Byte]): Boolean = {
+      if (currentByte == boundary(numMatchedBoundaryChars)) {


this might be problematic if the boundary has some common substring with the body. E.g. let's say the boundary is AAB...

The body goes: AAAB..., so we have one A byte from the "proper" body, and then the boundary. We read A, A, advancing numMatchedBoundaryChars, but then we get a mismatch, which resets the counter.

So now we'll continue on reading AB..., and we'll miss the boundary.

Instead, we should have retreated our steps and checked if some other prefix didn't match. I think it's a well known problem in CS ... maybe https://en.wikipedia.org/wiki/Knuth–Morris–Pratt_algorithm ?

Either way, we'll need tests for this :)

Thats true! Good catch 👍
I think I'll get KMP working and try to add some tests.

adamw · 2023-10-30T14:02:06Z

The best place to add a test would be here:

tapir/server/tests/src/main/scala/sttp/tapir/server/tests/ServerMultipartTests.scala

Line 22 in c608feb

class ServerMultipartTests[F[_], OPTIONS, ROUTE](

This will then be included in tests for all server interpreters, including this one:

tapir/server/jdkhttp-server/src/test/scala/sttp/tapir/server/jdkhttp/JdkHttpServerTest.scala

Line 17 in 33a1dbd

    
           new ServerBasicTests(createServerTest, interpreter, invulnerableToUnsanitizedHeaders = false).tests() ++

Alternatively, you can add a stand-alone test in JdkHttpServerTest

adamw · 2023-10-30T14:02:32Z

As for commits/squashes, I wouldn't be concerned about it. Please just add commits as is most convenient for you, I usually do normal merges :)

adamw · 2023-10-30T14:07:53Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+        stream = ByteStream()
+        bodySize = 0
+      } else if (numMatchedBoundaryChars == 0) {
+        circularBuffer.getBytes.foreach(byte => stream.underlying.write(byte.toInt))


hm are you really using CircularBuffer as a circular buffer, not only as a buffer? It seems you are buffering bytes only when there's some agreement on the boundary bytes, but as soon as there's a mismatch the bytes are copied to the sink, and the buffer is reset (by setting its currentIndex to -1). So it never wraps?

It seems you are correct 😄
I went through a few iterations, so i don't think we need the wrapping anymore. I'll remove it and rewrite with a normal arraybuffer.

jnatten · 2023-11-02T12:28:33Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+      val foundFinalBoundary = endMatcher.matchByte(currentByte.toByte)
+      val foundBoundary = bodyBoundaryMatcher.matchByte(currentByte.toByte) || foundFinalBoundary


I guess this could be done with another state that looks for -- / \r\n instead.
Not sure which one i prefer. I guess adding the state would be marginally more performant. What do you think?

I think that's good as-is :)

adamw · 2023-11-06T08:12:04Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/KMPMatcher.scala

+
+import scala.collection.mutable
+
+class KMPMatcher(delimiter: Array[Byte]) {


one final comment here, sorry :) maybe it would be worth adding some unit tests for KMPMatcher to check for some corner cases and "happy paths"? I know there's an additional multipart test, but there might be some data combinations which aren't covered. And I'm not that familiar with the algorithm to see that it's correct ;)

Don't be sorry to have requirements that makes the code better 😄
I'll add some tests!

adamw · 2023-11-07T08:09:51Z

server/jdkhttp-server/src/main/scala/sttp/tapir/server/jdkhttp/internal/ParsedMultiPart.scala

+        completePart(bodyInputStream)
+        stream = ByteStream()
+        bodySize = 0
+      } else if (bodyBoundaryMatcher.getMatches == 0 && endMatcher.getMatches == 0) {


isn't it possible that we'll overflow the buffer, if we constantly get a match? e.g. the body is only-As and the boundary starts with an A?

You're right, that could make us drop bytes.
I'll try to avoid that 😄

jnatten · 2023-11-08T12:43:46Z

server/jdkhttp-server/src/test/scala/sttp/tapir/server/jdkhttp/internal/KMPMatcherTest.scala

+  override def tests: Resource[IO, List[Test]] = Resource.eval(
+    IO.pure(
+      List(
+        Test("That matching over a set of bytes works and does not allow writing of any bytes if only matching") {
+          Future {


Is this the correct way to add individual tests that have "nothing" to do with the servers?
Felt a bit weird, but I'm not familiar enough to be sure.

the TestSuite is more aimed at writing server/client interpreter tests. For normal unit tests, you can simply use what ScalaTest offers, see e.g.

tapir/core/src/test/scala/sttp/tapir/CodecTest.scala

Line 21 in f4b4d6b

class CodecTest extends AnyFlatSpec with Matchers with Checkers with Inside {

Aha, makes sense. I'll fix 👍

This patch makes the multipart body parsing byte by byte to introduce writing big multipart parts to files to conserve memory.

This patch improves the parsing in `ParsedMultiPart` by: - Introducing mutability to improve performance by avoiding to allocate the `lastXBytes` array on every iteration. - Seek for the boundary match on character level rather than doing the entire `sameElements` comparison each iteration. - Avoid truncating file/stream on the end by not writing bytes until we know that they are not part of the boundary. - Make temp file threshold for multipart configurable via `JdkHttpServerOptions`

According to the spec the boundary should never be longer than 70 chars and that also avoids some potential problems while parsing a body.

adamw · 2023-11-09T17:29:15Z

Took some time, but this looks good - thanks for your work! :)

jnatten commented Oct 6, 2023

View reviewed changes

adamw reviewed Oct 10, 2023

View reviewed changes

adamw reviewed Oct 11, 2023

View reviewed changes

jnatten requested a review from adamw October 17, 2023 15:43

jnatten commented Oct 17, 2023

View reviewed changes

adamw reviewed Oct 30, 2023

View reviewed changes

jnatten commented Nov 2, 2023

View reviewed changes

jnatten requested a review from adamw November 2, 2023 12:29

adamw reviewed Nov 6, 2023

View reviewed changes

adamw reviewed Nov 7, 2023

View reviewed changes

jnatten commented Nov 8, 2023

View reviewed changes

jnatten added 7 commits November 8, 2023 14:05

jdkhttp-server: Write multipart parts bigger than threshold to files

5619269

This patch makes the multipart body parsing byte by byte to introduce writing big multipart parts to files to conserve memory.

Add multipart test for parsing a body with partial boundary

7b8b690

Implement boundary matching with Knuth-Morris-Pratt

137196c

Fix ParsedMultiPart.buffer overflowing

4a475c6

Add tests for KMPMatcher

8946306

Add boundary max size for jdkhttp

f20c1ea

According to the spec the boundary should never be longer than 70 chars and that also avoids some potential problems while parsing a body.

adamw merged commit 185b893 into softwaremill:master Nov 9, 2023


		import scala.collection.IndexedSeqView

		class CircularBuffer(bufferSize: Int) {

		val foundFinalBoundary = endMatcher.matchByte(currentByte.toByte)
		val foundBoundary = bodyBoundaryMatcher.matchByte(currentByte.toByte) \|\| foundFinalBoundary


		import scala.collection.mutable

		class KMPMatcher(delimiter: Array[Byte]) {

jdkhttp-server: Write multipart parts bigger than threshold to files #3227

jdkhttp-server: Write multipart parts bigger than threshold to files #3227

Conversation

jnatten commented Oct 6, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnatten Oct 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Oct 11, 2023

jnatten commented Oct 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnatten commented Oct 30, 2023

adamw commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Oct 30, 2023

adamw commented Oct 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnatten Nov 2, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnatten Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

jnatten Nov 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamw commented Nov 9, 2023

jnatten Oct 6, 2023 •

edited

Loading

adamw Oct 11, 2023 •

edited

Loading

jnatten commented Oct 17, 2023 •

edited

Loading

jnatten Nov 2, 2023 •

edited

Loading

jnatten Nov 8, 2023 •

edited

Loading

jnatten Nov 8, 2023 •

edited

Loading