`IO::Memory#to_s` should honor encoding options #11015

HertzDevil · 2021-07-26T06:23:14Z

IO::Memory#to_s assumes the caller stores UTF-8 byte sequences, so if some other encoding is explicitly set and the IO::Memory's string methods are used, the result will be incorrect:

io = IO::Memory.new
io.set_encoding "UTF-16LE"
io << "abc"
io.to_s # => "a\u0000b\u0000c\u0000"

Likewise, #to_s(IO) writes the underlying bytes unmodified:

io1 = IO::Memory.new
io1.set_encoding "UTF-32LE"

io2 = IO::Memory.new
io2.set_encoding "UTF-16LE"

io1.write_utf8 "abc😂".to_slice
io1.to_s io2
byte_slice = io2.to_slice
utf16_slice = Slice.new(byte_slice.to_unsafe.unsafe_as(Pointer(UInt16)), byte_slice.size // sizeof(UInt16))

byte_slice                     # => Bytes[97, 0, 0, 0, 98, 0, 0, 0, 99, 0, 0, 0, 2, 246, 1, 0]
utf16_slice                    # => Slice[97, 0, 98, 0, 99, 0, 62978, 1]
String.from_utf16(utf16_slice) # => "a\u0000b\u0000c\u0000\u0001"

The first overload effectively calls String.new(to_slice). It should use this String constructor instead, which will perform the decoding on construction, whenever the IO::Memory has a non-default encoding. (If the IO::Memory already uses UTF-8, the returned String will expose invalid characters as U+FFFD automatically.)

The second overload is exactly io.write(to_slice). This one should similarly use the undocumented String.encode:

class IO::Memory
  def to_s(io : IO) : Nil
    String.encode(to_slice, self.encoding, io.encoding, io, io.@encoding.try &.invalid)
  end
end

Such a rewrite will in fact provide the only way to convert between arbitrary encodings without going through UTF-8, unless #11018 is resolved.

The text was updated successfully, but these errors were encountered:

asterite · 2022-03-05T16:10:22Z

What's the point of appending strings, encoding them, and have to_s decode them? Just use String.build without an encoding in this case. String isn't always utf-8 so the current behavior is fine.

asterite · 2022-03-05T16:19:06Z

I guess it makes sense if you write the bytes, then ask for the string...

straight-shoota · 2022-03-05T20:46:50Z

String isn't always utf-8 so the current behavior is fine.

But it really is. At least in theory. There may be non-UTF-8 bytes for practical reasons. But the general idea is still that String is UTF-8. It can contain some invalid bytes, but it is most certainly not meant to contain data with an entirely different encoding.

HertzDevil added the kind:bug A bug in the code. Does not apply to documentation, specs, etc. label Jul 26, 2021

straight-shoota added the topic:stdlib:text label Jul 26, 2021

HertzDevil mentioned this issue Jul 26, 2021

Converting strings between arbitrary encodings directly #11018

Open

straight-shoota mentioned this issue Mar 5, 2022

Fix: Honour encoding in IO::Memory#to_s #11875

Merged

straight-shoota closed this as completed in #11875 Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`IO::Memory#to_s` should honor encoding options #11015

`IO::Memory#to_s` should honor encoding options #11015

HertzDevil commented Jul 26, 2021 •

edited

Loading

asterite commented Mar 5, 2022

asterite commented Mar 5, 2022

straight-shoota commented Mar 5, 2022

IO::Memory#to_s should honor encoding options #11015

IO::Memory#to_s should honor encoding options #11015

Comments

HertzDevil commented Jul 26, 2021 • edited Loading

asterite commented Mar 5, 2022

asterite commented Mar 5, 2022

straight-shoota commented Mar 5, 2022

`IO::Memory#to_s` should honor encoding options #11015

`IO::Memory#to_s` should honor encoding options #11015

HertzDevil commented Jul 26, 2021 •

edited

Loading