Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strip() fails on string read from non-utf8 file #7525

Closed
swadey opened this issue Jul 5, 2014 · 3 comments
Closed

strip() fails on string read from non-utf8 file #7525

swadey opened this issue Jul 5, 2014 · 3 comments
Labels
unicode Related to unicode characters and encodings

Comments

@swadey
Copy link
Contributor

swadey commented Jul 5, 2014

test.txt contains a single character:

\377

I get this error:

julia> f = open("test.txt")
IOStream(<file test.txt>)

julia> x = readline(f)
"\ufffd"

julia> strip(x)
ERROR: BoundsError()
 in getindex at ./array.jl:267
 in getindex at ./utf8.jl:111
 in lstrip at string.jl:1414
 in lstrip at string.jl:1410
 in strip at string.jl:1434
@stevengj
Copy link
Member

stevengj commented Jul 5, 2014

Can you run readbytes(f) instead so that we can see the actual bytes in the file?

Is it UTF-16, or...?

@swadey
Copy link
Contributor Author

swadey commented Jul 5, 2014

Sure it's 0xff that causes the issue. It's from the 20 newsgroup dataset.

julia> x = readbytes(f)
2-element Array{Uint8,1}:
 0xff
 0x0a

@JeffBezanson
Copy link
Member

see also #1792

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

3 participants