Added ability to buffered read huge strings in custom KSerializers #2012

fred01 · 2022-08-16T09:31:03Z

#1987 Added method which allow to perform large string handling (decode base64 binary for example) at user level

fred01 · 2022-08-16T09:41:48Z

Example usage - https://github.com/fred01/TestStreams/blob/master/src/main/kotlin/Main.kt

sandwwraith

Hi, I think we can try to fit this into the 1.5.0 release window. Can you please 1) rebase PR on the actual dev and 2) update API dumps?

formats/json/commonMain/src/kotlinx/serialization/json/internal/StreamingJsonDecoder.kt

fred01 · 2022-11-25T08:52:29Z

Do I need update API dumps if jvmApiCheck is passed?

sandwwraith · 2022-11-25T12:31:52Z

If check task is successful, then API dumps won't change on updating

sandwwraith · 2022-11-25T12:40:15Z

But :kotlinx-serialization-core:jvmApiCheck is not successful in your case

fred01 · 2022-11-25T19:14:43Z

You right, sorry, my bad. Fixed

slavonnet · 2022-11-27T11:46:07Z

I think much better use InputStream (decodeFromStream) + define custom serializer for BASE64 element

In serializer you can use
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64InputStream.html (from apache or android core)

or

create temporary limited size ByteArrayBuffer / BufferedInputStream read bytes windowed/chunked by buffer size , decode and save to bytearray or temp file (outputStream). After finish create String from ByteArray or file.

Dont use ...From/ToString because its cause AllRead() and create large String Object, You was need 6g buffer * 3 times (for full read, for convert and for object). Then you read in StreamMode, you can read by bytes and reuse one 64k temp buffer

Also look
https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b
This request have some idea parts

sandwwraith · 2022-11-29T15:37:58Z

@fred01 :kotlinx-serialization-json:jvmApiCheck as well

fred01 · 2022-12-03T11:52:28Z

I think much better use InputStream (decodeFromStream) + define custom serializer for BASE64 element

In serializer you can use https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64InputStream.html (from apache or android core)

or

create temporary limited size ByteArrayBuffer / BufferedInputStream read bytes windowed/chunked by buffer size , decode and save to bytearray or temp file (outputStream). After finish create String from ByteArray or file.

Dont use ...From/ToString because its cause AllRead() and create large String Object, You was need 6g buffer * 3 times (for full read, for convert and for object). Then you read in StreamMode, you can read by bytes and reuse one 64k temp buffer

Also look https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b This request have some idea parts

Current approach is decode/encode algorithm-agnostic, it not force you to use Base64. It's still possible to use Apache's Base64InputStream at user level, just feed chunks to it.
I use toString() only for current buffer which has 16kb max length

sviridov-alexey · 2022-12-05T16:14:51Z

Also look https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b This request have some idea parts

Yes, I looked closely at this PR before starting this one. Get some inspiration from it.

sandwwraith

You likely need to rebase PR on the dev branch, as there are some optimization changes happened to lexer

core/commonMain/src/kotlinx/serialization/encoding/ChunkedDecoder.kt

formats/json-tests/jvmTest/src/kotlinx/serialization/json/JsonChunkedDecoderTest.kt

formats/json/commonMain/src/kotlinx/serialization/json/internal/StreamingJsonDecoder.kt

core/commonMain/src/kotlinx/serialization/encoding/ChunkedDecoder.kt

sandwwraith · 2022-12-09T14:48:56Z

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/JsonLexer.kt

+        var char = source[currentPosition] // Avoid two range checks visible in the profiler
+        while (char != STRING) {
+            if (++currentPosition >= source.length) {
+                // end of chunk


It seems that you're not handling string escape sequences. Is that intentional?

Yes and no. For the first time, it was intentionally, because I suppose to consume base64 string, which can't contain double quote. But later, I prefer to generic approach, and seems, now I should handle double quotes as well. Will fix

Added support for escaping

Also, please note - to properly handle escaping, I'm forced to move actual decoding method from ReaderJsonLexer to AbstractJsonLexer. In other case I would need un-private a lot amount of methods of AbstractJsonLexer.

Sorry, I don't quite understand. Can you elaborate on that please?

Previously, I placed my method in ReaderJsonLexer to use highest possible hierarchy level. But to properly handle escaping I forced to move handling method to AbstractJsonLexer, go down one level. Is it OK?

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/JsonLexer.kt

formats/json/commonMain/src/kotlinx/serialization/json/internal/StreamingJsonDecoder.kt

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/StringJsonLexer.kt

- Support escaping - Support lenient mode - KDoc fixes - Formatting

- Added sample usage to method's KDoc

core/commonMain/src/kotlinx/serialization/encoding/ChunkedDecoder.kt

sandwwraith · 2023-01-16T15:17:10Z

formats/json-tests/jvmTest/src/kotlinx/serialization/json/JsonChunkedDecoderTest.kt

+data class ClassWithLargeStringDataField(val largeStringField: LargeStringData)
+
+
+object LargeStringSerializer : KSerializer<LargeStringData> {


Given that interface and implementation you've provided are located in commonMain, i.e. multiplatform, they should be tested in commonTest, too — at least that part which is possible. IIRC Base64 is not MPP, so you can move LargeStringSerializer and its test to commonTest and leave LargeBase64StringSerializer here.

Done. BTW, I have a simple native kotlin Base64 implementation, inspired by Okio https://android.googlesource.com/platform/external/okhttp/+/a2cab72aa5ff730ba2ae987b45398faafffeb505/okio/okio/src/main/java/okio/Base64.java (apache license) To be honest it's just converted from java and slightly corrected. Is it worth (allowed) to use it here? In that case we can move this test completely to common part

I'm not sure if it's worth it, as testing can be done without it

formats/json-tests/jvmTest/src/kotlinx/serialization/json/JsonChunkedDecoderTest.kt

sandwwraith · 2023-01-16T15:22:10Z

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/JsonLexer.kt

+        var char = source[currentPosition] // Avoid two range checks visible in the profiler
+        while (char != STRING) {
+            if (++currentPosition >= source.length) {
+                // end of chunk


Sorry, I don't quite understand. Can you elaborate on that please?

formats/json-tests/jvmTest/src/kotlinx/serialization/json/JsonChunkedDecoderTest.kt

sandwwraith · 2023-01-16T15:23:28Z

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/AbstractJsonLexer.kt

@@ -307,6 +306,54 @@ internal abstract class AbstractJsonLexer {
     */
    abstract fun consumeKeyString(): String

+    private fun insideString(isLenient:Boolean, char:Char):Boolean = if (isLenient) { charToTokenClass(char) == TC_OTHER } else { char != STRING }


formatting: spaces

still formatting

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/AbstractJsonLexer.kt

…l/lexer/AbstractJsonLexer.kt Co-authored-by: Leonid Startsev <[email protected]>

…der.kt Co-authored-by: Leonid Startsev <[email protected]>

- Slighly modified KDoc documention and example - Moved non-base64 part of test to json commonText - Avoid code duplication in plain string test

sandwwraith

I think it's OK after fixing minor comments

core/commonMain/src/kotlinx/serialization/encoding/ChunkedDecoder.kt

sandwwraith · 2023-02-07T16:34:48Z

formats/json/commonMain/src/kotlinx/serialization/json/internal/lexer/AbstractJsonLexer.kt

@@ -307,6 +306,54 @@ internal abstract class AbstractJsonLexer {
     */
    abstract fun consumeKeyString(): String

+    private fun insideString(isLenient:Boolean, char:Char):Boolean = if (isLenient) { charToTokenClass(char) == TC_OTHER } else { char != STRING }


still formatting

sandwwraith · 2023-02-07T16:35:47Z

Although check failing tests on non-JVM platforms

- Slighly KDoc modifications - Fixed tests for non-JVM platforms

sandwwraith

Great job, thank you!

fred01 marked this pull request as ready for review August 16, 2022 10:01

qwwdfsad added the json label Oct 14, 2022

sandwwraith reviewed Nov 16, 2022

View reviewed changes

formats/json/commonMain/src/kotlinx/serialization/json/internal/StreamingJsonDecoder.kt Outdated Show resolved Hide resolved

fred01 changed the base branch from master to dev November 24, 2022 22:35

fred01 added 5 commits November 25, 2022 08:25

Added stream-friendly version of decodeString

636db65

Added changes to API golden files

dc49171

Fixed review defects

aee9834

Ajust MR to master branch

6cf8e9d

Migrate to star imports

ff4b726

fred01 force-pushed the master branch from d111680 to ff4b726 Compare November 25, 2022 07:37

Migrate to star imports

39b1b8b

Dump new APIs

e8949cf

Dumped kotlinx-serialization-json.api

c71e8a6

sandwwraith requested changes Dec 9, 2022

View reviewed changes

fred01 added 2 commits January 2, 2023 00:52

Fixed review defects

a8a76c6

- Support escaping - Support lenient mode - KDoc fixes - Formatting

Fixed review defects

1581a12

- Added sample usage to method's KDoc

fred01 requested a review from sandwwraith January 5, 2023 07:17

sandwwraith reviewed Jan 16, 2023

View reviewed changes

sandwwraith requested changes Jan 16, 2023

View reviewed changes

fred01 and others added 2 commits January 22, 2023 15:53

Update formats/json/commonMain/src/kotlinx/serialization/json/interna…

a644025

…l/lexer/AbstractJsonLexer.kt Co-authored-by: Leonid Startsev <[email protected]>

Update core/commonMain/src/kotlinx/serialization/encoding/ChunkedDeco…

9afcfb6

…der.kt Co-authored-by: Leonid Startsev <[email protected]>

fred01 and others added 3 commits January 22, 2023 15:53

Update core/commonMain/src/kotlinx/serialization/encoding/ChunkedDeco…

bf17b67

…der.kt Co-authored-by: Leonid Startsev <[email protected]>

Update core/commonMain/src/kotlinx/serialization/encoding/ChunkedDeco…

b08b5fa

…der.kt Co-authored-by: Leonid Startsev <[email protected]>

Fixed review defects

e9998f6

- Slighly modified KDoc documention and example - Moved non-base64 part of test to json commonText - Avoid code duplication in plain string test

fred01 requested a review from sandwwraith January 22, 2023 20:52

sandwwraith reviewed Feb 7, 2023

View reviewed changes

Fixed review defects

e194bd0

- Slighly KDoc modifications - Fixed tests for non-JVM platforms

fred01 requested a review from sandwwraith February 9, 2023 22:25

Apply suggestions from code review

5d8d770

sandwwraith approved these changes Feb 16, 2023

View reviewed changes

sandwwraith merged commit 90113a9 into Kotlin:dev Feb 16, 2023

Bodo1981 mentioned this pull request Apr 4, 2023

Support ChunkedDecoder to read big xml data chunked pdvrieze/xmlutil#132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added ability to buffered read huge strings in custom KSerializers #2012

Added ability to buffered read huge strings in custom KSerializers #2012

fred01 commented Aug 16, 2022

fred01 commented Aug 16, 2022 •

edited

Loading

sandwwraith left a comment

fred01 commented Nov 25, 2022

sandwwraith commented Nov 25, 2022

sandwwraith commented Nov 25, 2022

fred01 commented Nov 25, 2022

slavonnet commented Nov 27, 2022

sandwwraith commented Nov 29, 2022

fred01 commented Dec 3, 2022

sviridov-alexey commented Dec 5, 2022

sandwwraith left a comment

sandwwraith Dec 9, 2022

fred01 Dec 12, 2022

fred01 Jan 1, 2023

fred01 Jan 2, 2023

sandwwraith Jan 16, 2023

fred01 Jan 22, 2023

sandwwraith Jan 16, 2023

fred01 Jan 22, 2023

sandwwraith Feb 7, 2023

sandwwraith Jan 16, 2023

sandwwraith Jan 16, 2023

sandwwraith Feb 7, 2023

sandwwraith left a comment

sandwwraith Feb 7, 2023

sandwwraith commented Feb 7, 2023

sandwwraith left a comment

		data class ClassWithLargeStringDataField(val largeStringField: LargeStringData)


		object LargeStringSerializer : KSerializer<LargeStringData> {

Added ability to buffered read huge strings in custom KSerializers #2012

Added ability to buffered read huge strings in custom KSerializers #2012

Conversation

fred01 commented Aug 16, 2022

fred01 commented Aug 16, 2022 • edited Loading

sandwwraith left a comment

Choose a reason for hiding this comment

fred01 commented Nov 25, 2022

sandwwraith commented Nov 25, 2022

sandwwraith commented Nov 25, 2022

fred01 commented Nov 25, 2022

slavonnet commented Nov 27, 2022

sandwwraith commented Nov 29, 2022

fred01 commented Dec 3, 2022

sviridov-alexey commented Dec 5, 2022

sandwwraith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandwwraith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandwwraith commented Feb 7, 2023

sandwwraith left a comment

Choose a reason for hiding this comment

fred01 commented Aug 16, 2022 •

edited

Loading