Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About Full Unicode (Plane 1-16) support #2538

Closed
cloudwu opened this issue May 6, 2019 · 11 comments
Closed

About Full Unicode (Plane 1-16) support #2538

cloudwu opened this issue May 6, 2019 · 11 comments

Comments

@cloudwu
Copy link
Contributor

cloudwu commented May 6, 2019

ImWchar is defined as unsigned short (16bit int) now.

https://github.com/ocornut/imgui/blob/master/imgui.h#L124

typedef unsigned short ImWchar;     // A single U16 character for keyboard input/display. We encode them as multi bytes UTF-8 when used in strings.

And AddInputCharactersUTF8 / ImTextStrFromUtf8 / ImTextCountCharsFromUtf8 discard the code points in Plane 1-16 ,

https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1263
https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1678-L1710

void ImGuiIO::AddInputCharactersUTF8(const char* utf8_chars)
{
    while (*utf8_chars != 0)
    {
        unsigned int c = 0;
        utf8_chars += ImTextCharFromUtf8(&c, utf8_chars, NULL);
        if (c > 0 && c <= 0xFFFF)    // <---- DISCARDS c >= 0x10000
            InputQueueCharacters.push_back((ImWchar)c);
    }
}
       if (c < 0x10000)    // FIXME: Losing characters that don't fit in 2 bytes
            *buf_out++ = (ImWchar)c;

ImTextCharToUtf8 can not convert the code points great than 0x10000 , for example , U+20628 (𠘨).

https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1713

    //else if (c < 0x10000)
    {
        if (buf_size < 3) return 0;
        buf[0] = (char)(0xe0 + (c >> 12));
        buf[1] = (char)(0x80 + ((c>> 6) & 0x3f));
        buf[2] = (char)(0x80 + ((c ) & 0x3f));
        return 3;
    }

I think full unciode support is not complicated, all we need is define ImWchar as unsigned int and removing the branch (c < 0x10000).

Is there any other reason to use u16 internal rather than u32 ?

@ocornut
Copy link
Owner

ocornut commented May 6, 2019

Hello,

Is there any other reason to use u16 internal rather than u32 ?

(1)
We use various one-level lookup tables which could be problematic with large codepoints. Although U+20628 would be acceptable there, we're opening ourselves to a whole new range of problem which a full 32-bits range.
Out of experience do you know which ranges are realistically useful for your application?

(2)
Calls dones in ImFileOpen() or Win32's SetClipboardTextFn_DefaultImpl() assume ImWchar==wchar_t with MSVC librairies, which is currently the case, and would need to be fixed.

(3)
Other reasons include that it is untested, hasn't been requested until now, and text functions are often performance bottleneck so any changes to be done with extra care (we're touching twice the amount of memory). This is all bundled in TODO.txt as a general:
- font: what would it take to support codepoint higher than 0xFFFF? (smileys, etc.)

It is certainly doable but performance issues side-effects may hinded adoption. A compile-time toggle may facilitate it.

@cloudwu
Copy link
Contributor Author

cloudwu commented May 6, 2019

For 1, Full unicode support is 21bit (0-10FFFF), but the plane3+ are uncommon, so 18bit would met most of the needs . https://en.wikipedia.org/wiki/Unicode

plane1/2 are also rarely used, I suggest use an unsigned short map ImVector<unsigned short> and a sorted table ImVector<unsigned int> for (10000-10FFFF) instead of ImVector<ImWchar> IndexLookup; , if the code point great than FFFF, we can do a binary search on ImVector<unsigned int> .

For 2, UTF-16 (Windows use) is also multibyte, the code points after Plane 0 (0-FFFF) can be encode to 4 unsigned short (wchar_t) , so I think we should write a function to convert UTF-8 to UTF-16. https://en.wikipedia.org/wiki/UTF-16

For 3, for the performance bottleneck, I think the hotspot would be ImFont::RenderText , but it's base on UTF-8 strings rather than ImWchar strings . If there any other hotspot function ?

@ocornut
Copy link
Owner

ocornut commented May 6, 2019

Thanks for the details.

For 2. Ok. Also, those functions being Windows-only we may use Win32 functionalities as well if we don't encounter other cases.

For 3, One bottleneck for very large UI is CalcTextSizeA(), this is why it pulls the AdvanceX values from a dense buffer IndexAdvanceX[] but generally dear imgui is very perf conscious as we reprocess all text-related stuff. I don't think RenderText would be affected so much apart from IndexLookup[] touching twice the amount of memory, but IndexLookup[] access are marginal compared to ImFontGlyph access which are already occupying 40 bytes either way (increasing ImWchar size won't change that).

I think a first way would to be allow typedef ImWchar unsigned int in imconfig.h and figure out everything to fix by testing with MSVC/Clang/GCC. If the option can be selected at compile-time then I won't have issue with perfs.

@ocornut
Copy link
Owner

ocornut commented May 6, 2019

Just saw your edit, yes, a binary search could make sense for planes over 0.
I have an infrastructure to run some perf testing which could be used to make better measures, if we have the test case. Right now the code has manual inline to avoid overhead in non-optimized builds (e.g. visual studio default settings) so that's one thing to be aware of.

I would suggest working on the basic case first!

@cloudwu
Copy link
Contributor Author

cloudwu commented May 6, 2019

The CalcTextSizeA() may be the bottleneck, but I haven't seen it use the type ImWchar. Am I missing anything ?

@ocornut
Copy link
Owner

ocornut commented May 6, 2019

That's correct, apologies! Only RenderText uses it, via e.g. IndexLookup[].

@cloudwu
Copy link
Contributor Author

cloudwu commented May 6, 2019

I also suggest use a smaller cache (for example, a fixed 4096 slots hash table) and a sorted vector for IndexLookup. If cache missed, do a binary search in sorted vector and update the cache.

It may be more cpu cache friendly than a large (64K) IndexLookup table.

@ocornut
Copy link
Owner

ocornut commented May 6, 2019

The current scheme is quasi ideal for Ascii/Latin based languages meaning anything else is likely to be a trade-off for those users. I'm however happy to have those other techniques and ideas explored, it is just outside of my reach at the moment. If you engage in making a PR I suggest making this in two steps, as supporting for the simple compile-time typedef would be easier to merge fast/first, then we can look into more advanced ideas.

@cloudwu
Copy link
Contributor Author

cloudwu commented May 6, 2019

Ok, I will try to do it tomorrow, thanks.

ocornut added a commit that referenced this issue May 11, 2019
…Character(unsigned int c).

Examples/Backends: Don't filter characters under 0x10000 before calling io.AddInputCharacter(), the filtering is done in io.AddInputCharacter() itself. This is in prevision for fuller Unicode support. (#2538, #2541)
ocornut pushed a commit that referenced this issue Mar 3, 2020
fix build for WideCharToMultiByte
[3181ff1e] Full Unicode Support
[6c9e73ac] Fix ImTextCountUtf8BytesFromChar and ImTextCharToUtf8, these APIs assume the input is an unicode code point, not UTF-16
[ba85665b] Add AddInputCharacterUTF16 for windows backend to handle WM_CHAR
[fafdcaf0] Use Windows API to convert UTF-16 for ImFileOpen
[dc7d5925] Use windows API to convert UTF-16 for clipboard
ocornut pushed a commit that referenced this issue Mar 3, 2020
- Make ImWchar32 unsigned.
 - Fix Win32 version of ImFileOpen by including windows.h sooner.
 - Make ImGuiIO::AddInputCharacterUTF16() more robust by disallowing illegal
surrogate pairs.
 - Allow pushing higher plane codepoints through ImGuiIO::AddInputCharacter().
 - Minor cleaning up in the high-plane Unicode support.
 - Fix Clang -Wunreachable-code warning
@ocornut
Copy link
Owner

ocornut commented Mar 3, 2020

This is now merged, thank you @cloudwu @samhocevar !

@ocornut ocornut closed this as completed Mar 3, 2020
@cloudwu
Copy link
Contributor Author

cloudwu commented Mar 4, 2020

Thank you @ocornut :)

sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020
…rnut#2538)

fix build for WideCharToMultiByte
[3181ff1e] Full Unicode Support
[6c9e73ac] Fix ImTextCountUtf8BytesFromChar and ImTextCharToUtf8, these APIs assume the input is an unicode code point, not UTF-16
[ba85665b] Add AddInputCharacterUTF16 for windows backend to handle WM_CHAR
[fafdcaf0] Use Windows API to convert UTF-16 for ImFileOpen
[dc7d5925] Use windows API to convert UTF-16 for clipboard
sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020
…nut#2815)

- Make ImWchar32 unsigned.
 - Fix Win32 version of ImFileOpen by including windows.h sooner.
 - Make ImGuiIO::AddInputCharacterUTF16() more robust by disallowing illegal
surrogate pairs.
 - Allow pushing higher plane codepoints through ImGuiIO::AddInputCharacter().
 - Minor cleaning up in the high-plane Unicode support.
 - Fix Clang -Wunreachable-code warning
sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants