Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO::Memory#to_s should honor encoding options #11015

Closed
HertzDevil opened this issue Jul 26, 2021 · 3 comments · Fixed by #11875
Closed

IO::Memory#to_s should honor encoding options #11015

HertzDevil opened this issue Jul 26, 2021 · 3 comments · Fixed by #11875
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text

Comments

@HertzDevil
Copy link
Contributor

HertzDevil commented Jul 26, 2021

Original: #11011 (comment)

IO::Memory#to_s assumes the caller stores UTF-8 byte sequences, so if some other encoding is explicitly set and the IO::Memory's string methods are used, the result will be incorrect:

io = IO::Memory.new
io.set_encoding "UTF-16LE"
io << "abc"
io.to_s # => "a\u0000b\u0000c\u0000"

Likewise, #to_s(IO) writes the underlying bytes unmodified:

io1 = IO::Memory.new
io1.set_encoding "UTF-32LE"

io2 = IO::Memory.new
io2.set_encoding "UTF-16LE"

io1.write_utf8 "abc😂".to_slice
io1.to_s io2
byte_slice = io2.to_slice
utf16_slice = Slice.new(byte_slice.to_unsafe.unsafe_as(Pointer(UInt16)), byte_slice.size // sizeof(UInt16))

byte_slice                     # => Bytes[97, 0, 0, 0, 98, 0, 0, 0, 99, 0, 0, 0, 2, 246, 1, 0]
utf16_slice                    # => Slice[97, 0, 98, 0, 99, 0, 62978, 1]
String.from_utf16(utf16_slice) # => "a\u0000b\u0000c\u0000\u0001"

The first overload effectively calls String.new(to_slice). It should use this String constructor instead, which will perform the decoding on construction, whenever the IO::Memory has a non-default encoding. (If the IO::Memory already uses UTF-8, the returned String will expose invalid characters as U+FFFD automatically.)

The second overload is exactly io.write(to_slice). This one should similarly use the undocumented String.encode:

class IO::Memory
  def to_s(io : IO) : Nil
    String.encode(to_slice, self.encoding, io.encoding, io, io.@encoding.try &.invalid)
  end
end

Such a rewrite will in fact provide the only way to convert between arbitrary encodings without going through UTF-8, unless #11018 is resolved.

@asterite
Copy link
Member

asterite commented Mar 5, 2022

What's the point of appending strings, encoding them, and have to_s decode them? Just use String.build without an encoding in this case. String isn't always utf-8 so the current behavior is fine.

@asterite
Copy link
Member

asterite commented Mar 5, 2022

I guess it makes sense if you write the bytes, then ask for the string...

@straight-shoota
Copy link
Member

String isn't always utf-8 so the current behavior is fine.

But it really is. At least in theory. There may be non-UTF-8 bytes for practical reasons. But the general idea is still that String is UTF-8. It can contain some invalid bytes, but it is most certainly not meant to contain data with an entirely different encoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants