A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

BillyONeal · 2018-07-24T22:07:59Z

UTF-8 <-> UTF-16 is prohibited by http://eel.is/c++draft/locale.codecvt#virtuals-3, since 4 UTF-8 encoding units can produce 2 UTF-16 encoding units.

Unfortunately it is not so easy to fix as we would need the counterpart to do_unshift on the "wide" side of the interface which would be massively ABI breaking.

cubbimew · 2018-07-24T23:59:29Z

I've seen opensource use of std::codecvt_utf8_utf16 with file streams -- maybe the authors are unaware that their code is non-portable, since it works with libstdc++ (fails with libc++ and MSVC). Here's a demo: http://coliru.stacked-crooked.com/a/dd397ba886d9bbff

Can't the standard acknowledge that an implementation is allowed to do the right thing with "otherwise, the behavior is implementation-defined"? Or, perhaps introduce std::u16stream and std::u16filebuf that aren't tied to this restriction?

tahonermann · 2018-07-25T15:06:51Z

I've been aware of this limitation for some time, but I've never researched why the limitation exists. Is there something fundamental about std::basic_filebufthat doesn't work for MxN transcoders? I assume not since libstdc++ apparently works as expected with such transcoders. Perhaps the limitation exists in the standard simply because the desired behavior doesn't match (some) existing implementations?

Regardless, my answer for this is to deprecate std::codecvt in favor of some yet-to-be-determined replacement. My inclination is towards something like text_view, though that doesn't fully suffice as a replacement since text_view doesn't support writeable views (it supports read-only views and write-only iterators), nor can a stream be imbued with such a view (I'm so far unconvinced that a replacement for imbue() functionality is needed or desirable).

BillyONeal · 2018-07-25T19:38:38Z

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor. High quality UTF encoders use techniques to process more than one encoding unit at a time (e.g. most UTF-8 texts have long strings of 7 bit ASCII in them, and a vectorized version of that to UTF-16 is easy-ish to write). The paper claims codecvt is expensive due to the virtual function call, but so long as the virtual call is done on whole buffers rather than on single codepoints that virtual call should be noise; as UTF decoding itself is a fairly expensive operation.

Honestly, I think the ideal interface looks very much like codecvt with a better way of reporting errors and removal of the 1:N restriction. Unfortunately lifting the 1:N restriction is an ABI break for MSVC++ because we store only a single wchar_t inside the basic_filebuf when a codecvt facet is engaged :(

tahonermann · 2018-07-25T20:26:52Z

When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky.

Good question, @jwakely, any comments on this?

text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor.

This is definitely a concern. I've long intended to provide transcoding interfaces that would enable optimizations like you described when invoked with a contiguous range (or segmented contiguous ranges). Alas, I've made no progress on that front.

BillyONeal · 2018-07-25T21:15:50Z

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

tahonermann · 2018-07-25T21:45:11Z

Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P

I think we all play that role more often than not ;)

Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

jwakely · 2018-07-25T22:00:03Z

If libstdc++ gets it right it's not my doing, and I can't really comment on that code. It's indistinguishable from magic as far as I'm concerned.

BillyONeal · 2018-07-25T22:09:56Z

Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work?

I've seen the talk but not attempted implementing his algorithm yet. I wrote these a long time ago when I was trying to replace the guts of our codecvt facets for ABI-broken-world, but ABI-broken-world seems to only get further way so I had to stop that work for some time.

tahonermann added help wanted Extra attention is needed paper needed A paper proposing a specific solution is needed bug Something isn't working labels Aug 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

BillyONeal commented Jul 24, 2018

cubbimew commented Jul 24, 2018 •

edited

Loading

tahonermann commented Jul 25, 2018

BillyONeal commented Jul 25, 2018

tahonermann commented Jul 25, 2018

BillyONeal commented Jul 25, 2018

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018 •

edited

Loading

BillyONeal commented Jul 25, 2018

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33

Comments

BillyONeal commented Jul 24, 2018

cubbimew commented Jul 24, 2018 • edited Loading

tahonermann commented Jul 25, 2018

BillyONeal commented Jul 25, 2018

tahonermann commented Jul 25, 2018

BillyONeal commented Jul 25, 2018

tahonermann commented Jul 25, 2018

jwakely commented Jul 25, 2018 • edited Loading

BillyONeal commented Jul 25, 2018

cubbimew commented Jul 24, 2018 •

edited

Loading

jwakely commented Jul 25, 2018 •

edited

Loading