Skip to content

Commit

Permalink
fix: html5 encoding detection case insensitive re: meta tag
Browse files Browse the repository at this point in the history
  • Loading branch information
flavorjones committed Nov 14, 2022
1 parent cd2700a commit 6636e86
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 2 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ This version of Nokogiri uses [`jar-dependencies`](https://github.com/mkristian/
* [CRuby] The HTML5 parser now correctly handles text at the end of `form` elements.
* [CRuby] `HTML5::Document#fragment` now always uses `body` as the parsing context. Previously, fragments were parsed in the context of the associated document's root node, which allowed for inconsistent parsing. [[#2553](https://github.com/sparklemotion/nokogiri/issues/2553)]
* [CRuby] `Nokogiri::HTML5::Document#url` now correctly returns the URL passed to the constructor method. Previously it always returned `nil`. [[#2583](https://github.com/sparklemotion/nokogiri/issues/2583)]
* [CRuby] `HTML5` encoding detection is now case-insensitive with respect to `meta` tag charset declaration. [[#2693](https://github.com/sparklemotion/nokogiri/issues/2693)]
* [JRuby] Fixed a bug with adding the same namespace to multiple nodes via `#add_namespace_definition`. [[#1247](https://github.com/sparklemotion/nokogiri/issues/1247)]
* [JRuby] `NodeSet#[]` now raises a TypeError if passed an invalid parameter type. [[#2211](https://github.com/sparklemotion/nokogiri/issues/2211)]
* [CRuby+OSX] Compiling from source on MacOS will use the clang option `-Wno-unknown-warning-option` to avoid errors when Ruby injects options that clang doesn't know about. [[#2689](https://github.com/sparklemotion/nokogiri/issues/2689)]
Expand Down
2 changes: 1 addition & 1 deletion lib/nokogiri/html5.rb
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ def reencode(body, content_type = nil)
# look for a charset in a meta tag in the first 1024 bytes
unless encoding
data = body[0..1023].gsub(/<!--.*?(-->|\Z)/m, "")
data.scan(/<meta.*?>/m).each do |meta|
data.scan(/<meta.*?>/im).each do |meta|
encoding ||= meta[/charset=["']?([^>]*?)($|["'\s>])/im, 1]
end
end
Expand Down
9 changes: 8 additions & 1 deletion test/html5/test_encoding.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,20 @@ def test_iso8859_encoding
assert_equal("<span>Señor</span>", doc.at("span").to_xml)
end

def test_charset_encoding
def test_meta_charset_encoding
utf8 = (+"<meta charset='utf-8'><span>Se\xC3\xB1or</span>")
.force_encoding(Encoding::ASCII_8BIT)
doc = Nokogiri::HTML5(utf8)
assert_equal("<span>Señor</span>", doc.at("span").to_xml)
end

def test_META_CHARSET_encoding
utf8 = (+"<META CHARSET='utf-8'><SPAN>Se\xC3\xB1or</SPAN>")
.force_encoding(Encoding::ASCII_8BIT)
doc = Nokogiri::HTML5(utf8)
assert_equal("<span>Señor</span>", doc.at("span").to_xml)
end

def test_bogus_encoding
bogus = (+"<meta charset='bogus'><span>Se\xF1or</span>")
.force_encoding(Encoding::ASCII_8BIT)
Expand Down

0 comments on commit 6636e86

Please sign in to comment.