-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A correct codecvt facet that works with basic_filebuf can't do UTF conversions #33
Comments
I've seen opensource use of Can't the standard acknowledge that an implementation is allowed to do the right thing with "otherwise, the behavior is implementation-defined"? Or, perhaps introduce |
I've been aware of this limitation for some time, but I've never researched why the limitation exists. Is there something fundamental about Regardless, my answer for this is to deprecate |
When people say "libstdc++ apparently works" have people actually verified it does the right thing when libstdc++ needs to cross a buffer boundary, or needs to seek in the file, etc.? I think these edge cases are solvable but unless they go out of their way to document that they support it I would be cautious. Triggering a surrogate pair split right on a buffer boundary, for example, needs to be handled correctly for correct behavior but is, at the very least, tricky. text_view looks interesting, but it looks like it operates on a codepoint-by-codepoint basis rather than through bulk submission, so I'd expect performance to be poor. High quality UTF encoders use techniques to process more than one encoding unit at a time (e.g. most UTF-8 texts have long strings of 7 bit ASCII in them, and a vectorized version of that to UTF-16 is easy-ish to write). The paper claims codecvt is expensive due to the virtual function call, but so long as the virtual call is done on whole buffers rather than on single codepoints that virtual call should be noise; as UTF decoding itself is a fairly expensive operation. Honestly, I think the ideal interface looks very much like codecvt with a better way of reporting errors and removal of the 1:N restriction. Unfortunately lifting the 1:N restriction is an ABI break for MSVC++ because we store only a single wchar_t inside the basic_filebuf when a codecvt facet is engaged :( |
Good question, @jwakely, any comments on this?
This is definitely a concern. I've long intended to provide transcoding interfaces that would enable optimizations like you described when invoked with a contiguous range (or segmented contiguous ranges). Alas, I've made no progress on that front. |
Too bad it's easier for people like me to just complain about the problem rather than actually solve it, right? :P Here are some benchmarks if it helps https://gist.github.com/BillyONeal/72dcde394758d4f9d82324774b8107e4 |
I think we all play that role more often than not ;)
Ohhh, neat, thanks! I'm assuming you're familiar with Bob Steagall's recent work? |
If libstdc++ gets it right it's not my doing, and I can't really comment on that code. It's indistinguishable from magic as far as I'm concerned. |
I've seen the talk but not attempted implementing his algorithm yet. I wrote these a long time ago when I was trying to replace the guts of our codecvt facets for ABI-broken-world, but ABI-broken-world seems to only get further way so I had to stop that work for some time. |
UTF-8 <-> UTF-16 is prohibited by http://eel.is/c++draft/locale.codecvt#virtuals-3, since 4 UTF-8 encoding units can produce 2 UTF-16 encoding units.
Unfortunately it is not so easy to fix as we would need the counterpart to do_unshift on the "wide" side of the interface which would be massively ABI breaking.
The text was updated successfully, but these errors were encountered: