Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid byte sequence error when reading files with malformed utf-8 sequences #353

Closed
mrorii opened this issue Jan 13, 2013 · 14 comments
Closed
Labels

Comments

@mrorii
Copy link
Contributor

mrorii commented Jan 13, 2013

For example, running linguist on this file throws an invalid byte sequence error:

$ wget https://raw.github.com/leoniedu/CongressoAberto/a4785785cb37e8095893dc411f0a030a57fd30f8/CongressoAbertoWP/wp-includes/js/swfupload/swfupload.js
$ linguist swfupload.js
/Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:209:in `split': invalid byte sequence in UTF-8 (ArgumentError)
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:209:in `lines'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/lib/linguist/blob_helper.rb:240:in `loc'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/gems/github-linguist-2.4.0/bin/linguist:24:in `'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/bin/linguist:23:in `load'
        from /Users/orii/.rvm/gems/ruby-1.9.3-p286/bin/linguist:23:in `'
@andygrunwald
Copy link
Contributor

I can confirm this error.

@gkze
Copy link

gkze commented Oct 3, 2013

Is this still a bug?

@armw4
Copy link

armw4 commented Oct 17, 2013

Same thing for me...

Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:41:in `split': invalid byte sequence in UTF-8 (ArgumentError)
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:41:in `lines'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:100:in `compiled_coffeescript?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:56:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/generated.rb:12:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/blob_helper.rb:277:in `generated?'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:74:in `block in compute_stats'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:69:in `each'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:69:in `compute_stats'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/lib/linguist/repository.rb:43:in `languages'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/gems/github-linguist-2.9.5/bin/linguist:14:in `<top (required)>'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/bin/linguist:23:in `load'
        from /Users/axw2/.rvm/gems/ruby-2.0.0-p247/bin/linguist:23:in `<main>'

@rastkojokic
Copy link

I can confirm this also...

@ghost
Copy link

ghost commented Jan 22, 2014

I received this error as well after manually running linguist on my repo.

@andygrunwald
Copy link
Contributor

I still can reproduce this with ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] on a debian wheezy.

This seems to be the same like #241

@pchaigno
Copy link
Contributor

I had the same error several times (last time on pdt-git/public).

I tried using force_encoding() and encode() on line 58 without effect.

@arfon
Copy link
Contributor

arfon commented Oct 31, 2014

Confirmed this is still an issue on 1.9.3-p484

@arfon
Copy link
Contributor

arfon commented Jan 25, 2015

I'm going to close this. Ruby 2.0 is nearly two years old now and I just don't see us investigating this any time soon sorry.

If anyone else wants to take a stab at this then please be my guest 😄

@arfon arfon closed this as completed Jan 25, 2015
@pchaigno
Copy link
Contributor

@arfon I get this error on the original file of this post with Ruby 2.2:

$ ruby --version
ruby 2.2.0p0 (2014-12-25 revision 49005) [x86_64-linux]

@arfon
Copy link
Contributor

arfon commented Jan 25, 2015

Well then.

@arfon arfon reopened this Jan 25, 2015
@echatman
Copy link

echatman commented Feb 2, 2015

I have also encountered this many times using ruby 2.2. Here is a recent stack trace:

$ linguist swfupload.js 
/var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:266:in `split': invalid byte sequence in UTF-8 (ArgumentError)
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:266:in `lines'
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/lib/linguist/blob_helper.rb:283:in `loc'
    from /var/lib/gems/2.2.0/gems/github-linguist-4.2.7/bin/linguist:51:in `<top (required)>'
    from /usr/local/bin/linguist:23:in `load'
    from /usr/local/bin/linguist:23:in `<main>'

@pchaigno pchaigno changed the title Invalid byte sequence error in Ruby 1.9 when reading files with malformed utf-8 sequences Invalid byte sequence error when reading files with malformed utf-8 sequences Jun 28, 2016
@lildude lildude added the Bug label Apr 27, 2017
@ApsarasX
Copy link

ApsarasX commented Jul 6, 2019

I still can confirm this error.
When I run the following command, my program crashed.

github-linguist Inderxer.asp.txt

BTW, The encoding for Inderxer.asp.txt is GB2312

smola added a commit to smola/linguist that referenced this issue Dec 1, 2019
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the
error would only be seen in wild usage (see issue github-linguist#353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude pushed a commit that referenced this issue Mar 16, 2020
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude pushed a commit that referenced this issue Mar 16, 2020
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.
lildude added a commit that referenced this issue Mar 19, 2020
* Prepare 7.9.0 release

* Put back the v7.8.0 version

We need this to ensure the versioning used during testing on GitHub.com doesn't cause caching problems in future

* fix errors on non-UTF-8 encodings

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)"
when BlobHelper#lines was called. Some problematic files include
UTF-16LE samples such as test/fixtures/Data/utf16le.

Errors were not present when computing stats from git repositories,
since git blobs are always read as ASCII-8BIT and that was working
correctly. However, when using FileBlob, encoding could be ASCII-8BIT,
UTF-8 or other, depending on the runtime value of Encoding.external_encoding.

Tests were not catching the error since they were forcing
Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the
error would only be seen in wild usage (see issue #353).

This commit forces ASCII-8BIT on File.read calls. The error is still
present if using memory blobs with other encodings.

* Decrease expected error count

* Set version to 7.9.0

Co-authored-by: Rick Winfrey <[email protected]>
Co-authored-by: Santiago M. Mola <[email protected]>
@lildude
Copy link
Member

lildude commented Mar 19, 2020

This has been resolved by #4730 which is now live on GitHub.com. Closing.

@lildude lildude closed this as completed Mar 19, 2020
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests