Skip to content

Commit

Permalink
fix errors on non-UTF-8 encodings (#4730)
Browse files Browse the repository at this point in the history
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
  • Loading branch information
smola authored Mar 16, 2020
1 parent f0e2d0d commit 7a7f01f
Show file tree
Hide file tree
Showing 4 changed files with 8 additions and 49 deletions.
2 changes: 1 addition & 1 deletion lib/linguist/file_blob.rb
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ def symlink?
#
# Returns a String.
def data
@data ||= File.read(@fullpath)
@data ||= File.read(@fullpath, :encoding => "ASCII-8BIT")
end

# Public: Get byte size
Expand Down
4 changes: 2 additions & 2 deletions test/helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def fixture_blob(name)

def fixture_blob_memory(name)
filepath = (name =~ /^\//)? name : File.join(fixtures_path, name)
content = File.read(filepath)
content = File.read(filepath, :encoding => "ASCII-8BIT")
Linguist::Blob.new(name, content)
end

Expand All @@ -32,7 +32,7 @@ def sample_blob(name)

def sample_blob_memory(name)
filepath = (name =~ /^\//)? name : File.join(samples_path, name)
content = File.read(filepath)
content = File.read(filepath, :encoding => "ASCII-8BIT")
Linguist::Blob.new(name, content)
end

Expand Down
28 changes: 5 additions & 23 deletions test/test_blob.rb
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,6 @@
class TestBlob < Minitest::Test
include Linguist

def setup
silence_warnings do
# git blobs are normally loaded as ASCII-8BIT since they may contain data
# with arbitrary encoding not known ahead of time
@original_external = Encoding.default_external
Encoding.default_external = Encoding.find("ASCII-8BIT")
end
end

def teardown
silence_warnings do
Encoding.default_external = @original_external
end
end

def script_blob(name)
blob = sample_blob_memory(name)
blob.instance_variable_set(:@name, 'script')
Expand Down Expand Up @@ -62,26 +47,23 @@ def test_lines
assert_equal 474, sample_blob_memory("Emacs Lisp/ess-julia.el").lines.length
end

def test_lines_maintains_original_encoding
# Even if the file's encoding is detected as something like UTF-16LE,
# earlier versions of the gem made implicit guarantees that the encoding of
# each `line` is in the same encoding as the file was originally read (in
# practice, UTF-8 or ASCII-8BIT)
assert_equal Encoding.default_external, fixture_blob_memory("Data/utf16le").lines.first.encoding
end

def test_size
assert_equal 15, sample_blob_memory("Ruby/foo.rb").size
end

def test_loc
assert_equal 2, sample_blob_memory("Ruby/foo.rb").loc
assert_equal 3, fixture_blob_memory("Data/utf16le-windows").loc
assert_equal 3, fixture_blob_memory("Data/utf16le").loc
assert_equal 1, fixture_blob_memory("Data/iso8859-8-i").loc
end

def test_sloc
assert_equal 2, sample_blob_memory("Ruby/foo.rb").sloc
assert_equal 3, fixture_blob_memory("Data/utf16le-windows").sloc
assert_equal 3, fixture_blob_memory("Data/utf16le").sloc
assert_equal 1, fixture_blob_memory("Data/iso8859-8-i").sloc

end

def test_encoding
Expand Down
23 changes: 0 additions & 23 deletions test/test_file_blob.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,21 +11,6 @@ def silence_warnings
$VERBOSE = original_verbosity
end

def setup
silence_warnings do
# git blobs are normally loaded as ASCII-8BIT since they may contain data
# with arbitrary encoding not known ahead of time
@original_external = Encoding.default_external
Encoding.default_external = Encoding.find("ASCII-8BIT")
end
end

def teardown
silence_warnings do
Encoding.default_external = @original_external
end
end

def script_blob(name)
blob = sample_blob(name)
blob.instance_variable_set(:@name, 'script')
Expand Down Expand Up @@ -82,14 +67,6 @@ def test_lines
assert_equal 474, sample_blob("Emacs Lisp/ess-julia.el").lines.length
end

def test_lines_maintains_original_encoding
# Even if the file's encoding is detected as something like UTF-16LE,
# earlier versions of the gem made implicit guarantees that the encoding of
# each `line` is in the same encoding as the file was originally read (in
# practice, UTF-8 or ASCII-8BIT)
assert_equal Encoding.default_external, fixture_blob("Data/utf16le").lines.first.encoding
end

def test_size
assert_equal 15, sample_blob("Ruby/foo.rb").size
end
Expand Down

0 comments on commit 7a7f01f

Please sign in to comment.