Skip to content

Commit

Permalink
fix: long UTF-16 documents serialize correctly
Browse files Browse the repository at this point in the history
UTF-16 documents that are long enough to trigger an intermediate
libxml2 buffer flush are now serialized correctly.

This change works by setting the external encoding on the StringIO
object, and then using that encoding when constructing intermediate
strings from libxml2's buffer.
  • Loading branch information
flavorjones committed Jan 23, 2022
1 parent 0af1c5b commit 2e260f5
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 7 deletions.
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

## 1.14.0 / unreleased

### Notes

#### Faster, more reliable installation: Native Gem for ARM64 Linux

This version of Nokogiri ships full native gem support for the `aarch64-linux` platform, which should support AWS Graviton and other ARM Linux platforms. Please note that glibc >= 2.29 is required for aarch64-linux systems, see [Supported Platforms](https://nokogiri.org/#supported-platforms) for more information.
Expand All @@ -16,6 +18,11 @@ This version of Nokogiri ships full native gem support for the `aarch64-linux` p
This version of Nokogiri uses [`jar-dependencies`](https://github.com/mkristian/jar-dependencies) to manage most of the vendored Java dependencies. `nokogiri -v` now outputs maven metadata for all Java dependencies, and `Nokogiri::VERSION_INFO` also contains this metadata. [[#2432](https://github.com/sparklemotion/nokogiri/issues/2432)]


### Fixed

* [CRuby] UTF-16-encoded documents longer than ~4000 code points now serialize properly. Previously the serialized document was corrupted when it exceeded the length of libxml2's internal string buffer. [[#752](https://github.com/sparklemotion/nokogiri/issues/752)]


## 1.13.1 / 2022-01-13

### Fixed
Expand Down
6 changes: 4 additions & 2 deletions ext/nokogiri/nokogiri.c
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ void noko_init_html_sax_push_parser(void);
void noko_init_gumbo(void);
void noko_init_test_global_handlers(void);

static ID id_read, id_write;
static ID id_read, id_write, id_external_encoding;


#ifndef HAVE_VASPRINTF
Expand Down Expand Up @@ -135,9 +135,10 @@ noko_io_write(void *io, char *c_buffer, int c_buffer_len)
{
VALUE rb_args[2], rb_n_bytes_written;
VALUE rb_io = (VALUE)io;
rb_encoding *io_encoding = rb_to_encoding(rb_funcall(rb_io, id_external_encoding, 0));

rb_args[0] = rb_io;
rb_args[1] = rb_str_new(c_buffer, (long)c_buffer_len);
rb_args[1] = rb_enc_str_new(c_buffer, (long)c_buffer_len, io_encoding);

rb_n_bytes_written = rb_rescue(noko_io_write_check, (VALUE)rb_args, noko_io_write_failed, 0);
if (rb_n_bytes_written == Qundef) { return -1; }
Expand Down Expand Up @@ -277,4 +278,5 @@ Init_nokogiri()

id_read = rb_intern("read");
id_write = rb_intern("write");
id_external_encoding = rb_intern("external_encoding");
}
9 changes: 4 additions & 5 deletions lib/nokogiri/xml/node.rb
Original file line number Diff line number Diff line change
Expand Up @@ -1188,12 +1188,11 @@ def serialize(*args, &block)
}
end

encoding = options[:encoding] || document.encoding
options[:encoding] = encoding
options[:encoding] ||= document.encoding
encoding = Encoding.find(options[:encoding] || "UTF-8")

io = StringIO.new(String.new(encoding: encoding), "wb:#{encoding}:#{encoding}")

outstring = +""
outstring.force_encoding(Encoding.find(encoding || "utf-8"))
io = StringIO.new(outstring)
write_to(io, options, &block)
io.string
end
Expand Down
19 changes: 19 additions & 0 deletions test/xml/test_document_encoding.rb
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,25 @@ class TestDocumentEncoding < Nokogiri::TestCase
assert_equal("UTF-8", Nokogiri::LIBXML_COMPILED_VERSION.encoding.name)
assert_equal("UTF-8", Nokogiri::LIBXSLT_COMPILED_VERSION.encoding.name)
end

it "serializes UTF-16 correctly across libxml2 buffer flushes" do
# https://github.com/sparklemotion/nokogiri/issues/752
skip_unless_libxml2

# the document needs to be large enough to trigger a libxml2 buffer flush. the buffer size
# is determined by MINLEN in xmlIO.c, which is hardcoded to 4000 code points.
size = 4000
input = String.new(<<~XML, encoding: "UTF-16")
<?xml version="1.0" encoding="UTF-16"?>
<root>
<bar>#{"A" * size}</bar>
</root>
XML
expected_length = (input.bytesize * 2) + 2 # double character width, add BOM bytes 0xFEFF

output = Nokogiri::XML(input).to_xml
assert_equal(expected_length, output.bytesize)
end
end
end
end
Expand Down

0 comments on commit 2e260f5

Please sign in to comment.