-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid byte sequence error when reading files with malformed utf-8 sequences #353
Comments
I can confirm this error. |
Is this still a bug? |
Same thing for me...
|
I can confirm this also... |
I received this error as well after manually running linguist on my repo. |
I still can reproduce this with ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux] on a debian wheezy. This seems to be the same like #241 |
I had the same error several times (last time on pdt-git/public). I tried using |
Confirmed this is still an issue on |
I'm going to close this. Ruby 2.0 is nearly two years old now and I just don't see us investigating this any time soon sorry. If anyone else wants to take a stab at this then please be my guest 😄 |
@arfon I get this error on the original file of this post with Ruby 2.2:
|
Well then. |
I have also encountered this many times using ruby 2.2. Here is a recent stack trace:
|
I still can confirm this error. github-linguist Inderxer.asp.txt BTW, The encoding for |
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the error would only be seen in wild usage (see issue github-linguist#353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.
* Prepare 7.9.0 release * Put back the v7.8.0 version We need this to ensure the versioning used during testing on GitHub.com doesn't cause caching problems in future * fix errors on non-UTF-8 encodings Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings. * Decrease expected error count * Set version to 7.9.0 Co-authored-by: Rick Winfrey <[email protected]> Co-authored-by: Santiago M. Mola <[email protected]>
This has been resolved by #4730 which is now live on GitHub.com. Closing. |
For example, running linguist on this file throws an invalid byte sequence error:
The text was updated successfully, but these errors were encountered: