Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added ability to buffered read huge strings in custom KSerializers #2012

Merged
merged 17 commits into from
Feb 16, 2023

Conversation

fred01
Copy link
Contributor

@fred01 fred01 commented Aug 16, 2022

#1987 Added method which allow to perform large string handling (decode base64 binary for example) at user level

@fred01
Copy link
Contributor Author

fred01 commented Aug 16, 2022

@fred01 fred01 marked this pull request as ready for review August 16, 2022 10:01
@qwwdfsad qwwdfsad added the json label Oct 14, 2022
Copy link
Member

@sandwwraith sandwwraith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I think we can try to fit this into the 1.5.0 release window. Can you please 1) rebase PR on the actual dev and 2) update API dumps?

@fred01 fred01 changed the base branch from master to dev November 24, 2022 22:35
@fred01
Copy link
Contributor Author

fred01 commented Nov 25, 2022

Do I need update API dumps if jvmApiCheck is passed?

@sandwwraith
Copy link
Member

If check task is successful, then API dumps won't change on updating

@sandwwraith
Copy link
Member

But :kotlinx-serialization-core:jvmApiCheck is not successful in your case

@fred01
Copy link
Contributor Author

fred01 commented Nov 25, 2022

You right, sorry, my bad. Fixed

@slavonnet
Copy link

I think much better use InputStream (decodeFromStream) + define custom serializer for BASE64 element

In serializer you can use
https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64InputStream.html (from apache or android core)

or

create temporary limited size ByteArrayBuffer / BufferedInputStream read bytes windowed/chunked by buffer size , decode and save to bytearray or temp file (outputStream). After finish create String from ByteArray or file.

Dont use ...From/ToString because its cause AllRead() and create large String Object, You was need 6g buffer * 3 times (for full read, for convert and for object). Then you read in StreamMode, you can read by bytes and reuse one 64k temp buffer

Also look
https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b
This request have some idea parts

@sandwwraith
Copy link
Member

@fred01 :kotlinx-serialization-json:jvmApiCheck as well

@fred01
Copy link
Contributor Author

fred01 commented Dec 3, 2022

I think much better use InputStream (decodeFromStream) + define custom serializer for BASE64 element

In serializer you can use https://commons.apache.org/proper/commons-codec/apidocs/org/apache/commons/codec/binary/Base64InputStream.html (from apache or android core)

or

create temporary limited size ByteArrayBuffer / BufferedInputStream read bytes windowed/chunked by buffer size , decode and save to bytearray or temp file (outputStream). After finish create String from ByteArray or file.

Dont use ...From/ToString because its cause AllRead() and create large String Object, You was need 6g buffer * 3 times (for full read, for convert and for object). Then you read in StreamMode, you can read by bytes and reuse one 64k temp buffer

Also look https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b This request have some idea parts

  1. Current approach is decode/encode algorithm-agnostic, it not force you to use Base64. It's still possible to use Apache's Base64InputStream at user level, just feed chunks to it.

  2. I use toString() only for current buffer which has 16kb max length

@sviridov-alexey
Copy link

Also look https://github.com/Kotlin/kotlinx.serialization/pull/2101/files#diff-187a6f057732ac528be3c0acb7f4b5a83225c046d211361d4114eae76dee6b7b This request have some idea parts

Yes, I looked closely at this PR before starting this one. Get some inspiration from it.

Copy link
Member

@sandwwraith sandwwraith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You likely need to rebase PR on the dev branch, as there are some optimization changes happened to lexer

var char = source[currentPosition] // Avoid two range checks visible in the profiler
while (char != STRING) {
if (++currentPosition >= source.length) {
// end of chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you're not handling string escape sequences. Is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. For the first time, it was intentionally, because I suppose to consume base64 string, which can't contain double quote. But later, I prefer to generic approach, and seems, now I should handle double quotes as well. Will fix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added support for escaping

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please note - to properly handle escaping, I'm forced to move actual decoding method from ReaderJsonLexer to AbstractJsonLexer. In other case I would need un-private a lot amount of methods of AbstractJsonLexer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite understand. Can you elaborate on that please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, I placed my method in ReaderJsonLexer to use highest possible hierarchy level. But to properly handle escaping I forced to move handling method to AbstractJsonLexer, go down one level. Is it OK?

- Support escaping
- Support lenient mode
- KDoc fixes
- Formatting
- Added sample usage to method's KDoc
data class ClassWithLargeStringDataField(val largeStringField: LargeStringData)


object LargeStringSerializer : KSerializer<LargeStringData> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that interface and implementation you've provided are located in commonMain, i.e. multiplatform, they should be tested in commonTest, too — at least that part which is possible. IIRC Base64 is not MPP, so you can move LargeStringSerializer and its test to commonTest and leave LargeBase64StringSerializer here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. BTW, I have a simple native kotlin Base64 implementation, inspired by Okio https://android.googlesource.com/platform/external/okhttp/+/a2cab72aa5ff730ba2ae987b45398faafffeb505/okio/okio/src/main/java/okio/Base64.java (apache license) To be honest it's just converted from java and slightly corrected. Is it worth (allowed) to use it here? In that case we can move this test completely to common part

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's worth it, as testing can be done without it

var char = source[currentPosition] // Avoid two range checks visible in the profiler
while (char != STRING) {
if (++currentPosition >= source.length) {
// end of chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't quite understand. Can you elaborate on that please?

@@ -307,6 +306,54 @@ internal abstract class AbstractJsonLexer {
*/
abstract fun consumeKeyString(): String

private fun insideString(isLenient:Boolean, char:Char):Boolean = if (isLenient) { charToTokenClass(char) == TC_OTHER } else { char != STRING }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting: spaces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still formatting

fred01 and others added 3 commits January 22, 2023 15:53
- Slighly modified KDoc documention and example
- Moved non-base64 part of test to json commonText
- Avoid code duplication in plain string test
Copy link
Member

@sandwwraith sandwwraith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK after fixing minor comments

@@ -307,6 +306,54 @@ internal abstract class AbstractJsonLexer {
*/
abstract fun consumeKeyString(): String

private fun insideString(isLenient:Boolean, char:Char):Boolean = if (isLenient) { charToTokenClass(char) == TC_OTHER } else { char != STRING }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still formatting

@sandwwraith
Copy link
Member

Although check failing tests on non-JVM platforms

- Slighly KDoc modifications
- Fixed tests for non-JVM platforms
Copy link
Member

@sandwwraith sandwwraith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants