Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly serialize UTF-16 documents that are longer than libxml2's internal string buffer #2434

Merged
merged 3 commits into from
Jan 23, 2022

Conversation

flavorjones
Copy link
Member

What problem is this PR intended to solve?

#752 reported a bug in serializing long UTF-16 documents.

The serialized document was corrupted when we were not being careful to use the document encoding while collecting multiple libxml2 buffer flushes. As a result, an incorrect number of bytes were written, including some garbage, as well as incorrect BOMs appearing in the middle of the serialized document.

This fix works by setting the external encoding on the StringIO object used to collecting the serialized stream, and then using that encoding when constructing intermediate strings from libxml2's buffer.

Have you included adequate test coverage?

Yes!

Does this change affect the behavior of either the C or the Java implementations?

This fixes the CRuby/libxml2 behavior which now matches the JRuby behavior.

and consolidate and rewrite a few tests
UTF-16 documents that are long enough to trigger an intermediate
libxml2 buffer flush are now serialized correctly.

This change works by setting the external encoding on the StringIO
object, and then using that encoding when constructing intermediate
strings from libxml2's buffer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant