Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

flavorjones · 2022-01-23T17:42:02Z

What problem is this PR intended to solve?

#752 reported a bug in serializing long UTF-16 documents.

The serialized document was corrupted when we were not being careful to use the document encoding while collecting multiple libxml2 buffer flushes. As a result, an incorrect number of bytes were written, including some garbage, as well as incorrect BOMs appearing in the middle of the serialized document.

This fix works by setting the external encoding on the StringIO object used to collecting the serialized stream, and then using that encoding when constructing intermediate strings from libxml2's buffer.

Have you included adequate test coverage?

Yes!

Does this change affect the behavior of either the C or the Java implementations?

This fixes the CRuby/libxml2 behavior which now matches the JRuby behavior.

and consolidate and rewrite a few tests

UTF-16 documents that are long enough to trigger an intermediate libxml2 buffer flush are now serialized correctly. This change works by setting the external encoding on the StringIO object, and then using that encoding when constructing intermediate strings from libxml2's buffer.

…6-fix refactor: simplify fix from 2e260f5 / #2434

flavorjones added 3 commits January 23, 2022 12:16

style: clean up noko_io_* functions ahead of bugfix

ea27f02

style: convert doc encoding tests to minispec

0af1c5b

and consolidate and rewrite a few tests

flavorjones mentioned this pull request Jan 23, 2022

Output of #to_xml munged beyond certain file size using UTF-16 declaration #752

Closed

flavorjones added this to the v1.14.0 milestone Jan 23, 2022

flavorjones added the topic/encoding label Jan 23, 2022

flavorjones merged commit 53c8293 into main Jan 23, 2022

flavorjones deleted the 752-long-utf16-documents branch January 23, 2022 20:21

flavorjones added a commit that referenced this pull request Feb 8, 2022

refactor: simplify fix from 2e260f5 / #2434

b5768c6

flavorjones mentioned this pull request Feb 8, 2022

refactor: simplify fix from 2e260f5 / #2434 #2447

Merged

flavorjones added a commit that referenced this pull request Feb 9, 2022

Merge pull request #2447 from sparklemotion/flavorjones-simplify-utf1…

a602399

…6-fix refactor: simplify fix from 2e260f5 / #2434

flavorjones mentioned this pull request Jan 21, 2023

fix: serialization with pseudo-IO objects like Zip::OutputStream #2775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

flavorjones commented Jan 23, 2022

Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

Conversation

flavorjones commented Jan 23, 2022