About Full Unicode (Plane 1-16) support #2538

cloudwu · 2019-05-06T10:18:42Z

ImWchar is defined as unsigned short (16bit int) now.

https://github.com/ocornut/imgui/blob/master/imgui.h#L124

typedef unsigned short ImWchar;     // A single U16 character for keyboard input/display. We encode them as multi bytes UTF-8 when used in strings.

And AddInputCharactersUTF8 / ImTextStrFromUtf8 / ImTextCountCharsFromUtf8 discard the code points in Plane 1-16 ,

https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1263
https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1678-L1710

void ImGuiIO::AddInputCharactersUTF8(const char* utf8_chars)
{
    while (*utf8_chars != 0)
    {
        unsigned int c = 0;
        utf8_chars += ImTextCharFromUtf8(&c, utf8_chars, NULL);
        if (c > 0 && c <= 0xFFFF)    // <---- DISCARDS c >= 0x10000
            InputQueueCharacters.push_back((ImWchar)c);
    }
}

       if (c < 0x10000)    // FIXME: Losing characters that don't fit in 2 bytes
            *buf_out++ = (ImWchar)c;

ImTextCharToUtf8 can not convert the code points great than 0x10000 , for example , U+20628 (𠘨).

https://github.com/ocornut/imgui/blob/master/imgui.cpp#L1713

    //else if (c < 0x10000)
    {
        if (buf_size < 3) return 0;
        buf[0] = (char)(0xe0 + (c >> 12));
        buf[1] = (char)(0x80 + ((c>> 6) & 0x3f));
        buf[2] = (char)(0x80 + ((c ) & 0x3f));
        return 3;
    }

I think full unciode support is not complicated, all we need is define ImWchar as unsigned int and removing the branch (c < 0x10000).

Is there any other reason to use u16 internal rather than u32 ?

The text was updated successfully, but these errors were encountered:

ocornut · 2019-05-06T10:34:30Z

Hello,

Is there any other reason to use u16 internal rather than u32 ?

(1)
We use various one-level lookup tables which could be problematic with large codepoints. Although U+20628 would be acceptable there, we're opening ourselves to a whole new range of problem which a full 32-bits range.
Out of experience do you know which ranges are realistically useful for your application?

(2)
Calls dones in ImFileOpen() or Win32's SetClipboardTextFn_DefaultImpl() assume ImWchar==wchar_t with MSVC librairies, which is currently the case, and would need to be fixed.

(3)
Other reasons include that it is untested, hasn't been requested until now, and text functions are often performance bottleneck so any changes to be done with extra care (we're touching twice the amount of memory). This is all bundled in TODO.txt as a general:
- font: what would it take to support codepoint higher than 0xFFFF? (smileys, etc.)

It is certainly doable but performance issues side-effects may hinded adoption. A compile-time toggle may facilitate it.

cloudwu · 2019-05-06T11:35:26Z

For 1, Full unicode support is 21bit (0-10FFFF), but the plane3+ are uncommon, so 18bit would met most of the needs . https://en.wikipedia.org/wiki/Unicode

plane1/2 are also rarely used, I suggest use an unsigned short map ImVector<unsigned short> and a sorted table ImVector<unsigned int> for (10000-10FFFF) instead of ImVector<ImWchar> IndexLookup; , if the code point great than FFFF, we can do a binary search on ImVector<unsigned int> .

For 2, UTF-16 (Windows use) is also multibyte, the code points after Plane 0 (0-FFFF) can be encode to 4 unsigned short (wchar_t) , so I think we should write a function to convert UTF-8 to UTF-16. https://en.wikipedia.org/wiki/UTF-16

For 3, for the performance bottleneck, I think the hotspot would be ImFont::RenderText , but it's base on UTF-8 strings rather than ImWchar strings . If there any other hotspot function ?

ocornut · 2019-05-06T11:45:27Z

Thanks for the details.

For 2. Ok. Also, those functions being Windows-only we may use Win32 functionalities as well if we don't encounter other cases.

For 3, One bottleneck for very large UI is CalcTextSizeA(), this is why it pulls the AdvanceX values from a dense buffer IndexAdvanceX[] but generally dear imgui is very perf conscious as we reprocess all text-related stuff. I don't think RenderText would be affected so much apart from IndexLookup[] touching twice the amount of memory, but IndexLookup[] access are marginal compared to ImFontGlyph access which are already occupying 40 bytes either way (increasing ImWchar size won't change that).

I think a first way would to be allow typedef ImWchar unsigned int in imconfig.h and figure out everything to fix by testing with MSVC/Clang/GCC. If the option can be selected at compile-time then I won't have issue with perfs.

ocornut · 2019-05-06T11:51:24Z

Just saw your edit, yes, a binary search could make sense for planes over 0.
I have an infrastructure to run some perf testing which could be used to make better measures, if we have the test case. Right now the code has manual inline to avoid overhead in non-optimized builds (e.g. visual studio default settings) so that's one thing to be aware of.

I would suggest working on the basic case first!

cloudwu · 2019-05-06T11:54:17Z

The CalcTextSizeA() may be the bottleneck, but I haven't seen it use the type ImWchar. Am I missing anything ?

ocornut · 2019-05-06T12:08:31Z

That's correct, apologies! Only RenderText uses it, via e.g. IndexLookup[].

cloudwu · 2019-05-06T12:08:42Z

I also suggest use a smaller cache (for example, a fixed 4096 slots hash table) and a sorted vector for IndexLookup. If cache missed, do a binary search in sorted vector and update the cache.

It may be more cpu cache friendly than a large (64K) IndexLookup table.

ocornut · 2019-05-06T12:16:34Z

The current scheme is quasi ideal for Ascii/Latin based languages meaning anything else is likely to be a trade-off for those users. I'm however happy to have those other techniques and ideas explored, it is just outside of my reach at the moment. If you engage in making a PR I suggest making this in two steps, as supporting for the simple compile-time typedef would be easier to merge fast/first, then we can look into more advanced ideas.

cloudwu · 2019-05-06T12:23:13Z

Ok, I will try to do it tomorrow, thanks.

…Character(unsigned int c). Examples/Backends: Don't filter characters under 0x10000 before calling io.AddInputCharacter(), the filtering is done in io.AddInputCharacter() itself. This is in prevision for fuller Unicode support. (#2538, #2541)

fix build for WideCharToMultiByte [3181ff1e] Full Unicode Support [6c9e73ac] Fix ImTextCountUtf8BytesFromChar and ImTextCharToUtf8, these APIs assume the input is an unicode code point, not UTF-16 [ba85665b] Add AddInputCharacterUTF16 for windows backend to handle WM_CHAR [fafdcaf0] Use Windows API to convert UTF-16 for ImFileOpen [dc7d5925] Use windows API to convert UTF-16 for clipboard

- Make ImWchar32 unsigned. - Fix Win32 version of ImFileOpen by including windows.h sooner. - Make ImGuiIO::AddInputCharacterUTF16() more robust by disallowing illegal surrogate pairs. - Allow pushing higher plane codepoints through ImGuiIO::AddInputCharacter(). - Minor cleaning up in the high-plane Unicode support. - Fix Clang -Wunreachable-code warning

…2815)

ocornut · 2020-03-03T18:00:25Z

This is now merged, thank you @cloudwu @samhocevar !

cloudwu · 2020-03-04T03:26:42Z

Thank you @ocornut :)

…litate C-binding. (#2538, #2541, #2815)

#2541, #2815)

…rnut#2538) fix build for WideCharToMultiByte [3181ff1e] Full Unicode Support [6c9e73ac] Fix ImTextCountUtf8BytesFromChar and ImTextCharToUtf8, these APIs assume the input is an unicode code point, not UTF-16 [ba85665b] Add AddInputCharacterUTF16 for windows backend to handle WM_CHAR [fafdcaf0] Use Windows API to convert UTF-16 for ImFileOpen [dc7d5925] Use windows API to convert UTF-16 for clipboard

…nut#2815) - Make ImWchar32 unsigned. - Fix Win32 version of ImFileOpen by including windows.h sooner. - Make ImGuiIO::AddInputCharacterUTF16() more robust by disallowing illegal surrogate pairs. - Allow pushing higher plane codepoints through ImGuiIO::AddInputCharacter(). - Minor cleaning up in the high-plane Unicode support. - Fix Clang -Wunreachable-code warning

…cornut#2538, ocornut#2815)

…litate C-binding. (ocornut#2538, ocornut#2541, ocornut#2815)

…nut#2538, ocornut#2541, ocornut#2815)

Amend ef13d95 + c8ea0a0 (#2541, #2538, #2815)

ocornut added the font/text label May 6, 2019

cloudwu mentioned this issue May 7, 2019

Full Unicode Support #2541

Closed

samhocevar mentioned this issue Sep 29, 2019

Refreshed full Unicode support patch #2815

Closed

ocornut added a commit that referenced this issue Mar 3, 2020

Unicode: Changelog, comments, minimum CI integration. (#2541, #2538, #…

a413324

…2815)

ocornut closed this as completed Mar 3, 2020

ocornut added a commit that referenced this issue Mar 24, 2020

Added IMGUI_USE_WCHAR32 instead of "#define ImWchar ImWchar32" to fac…

670367e

…litate C-binding. (#2538, #2541, #2815)

ocornut added a commit that referenced this issue Mar 25, 2020

Alter definition of IM_UNICODE_ defines to faclitate C-binding. (#2538,

1d4b5de

#2541, #2815)

sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020

Unicode: Changelog, comments, minimum CI integration. (ocornut#2541, o…

76789b3

…cornut#2538, ocornut#2815)

sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020

Added IMGUI_USE_WCHAR32 instead of "#define ImWchar ImWchar32" to fac…

0bbb684

…litate C-binding. (ocornut#2538, ocornut#2541, ocornut#2815)

sergeyn pushed a commit to sergeyn/imgui that referenced this issue Mar 30, 2020

Alter definition of IM_UNICODE_ defines to faclitate C-binding. (ocor…

6d8a5ea

…nut#2538, ocornut#2541, ocornut#2815)

ocornut added a commit that referenced this issue May 20, 2020

IO: AddInputCharacters function ignore 0 input. (#3252)

c8cde28

Amend ef13d95 + c8ea0a0 (#2541, #2538, #2815)

dbr mentioned this issue Jan 18, 2021

Full unicode range support imgui-rs/imgui-rs#424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About Full Unicode (Plane 1-16) support #2538

About Full Unicode (Plane 1-16) support #2538

cloudwu commented May 6, 2019

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019 •

edited

Loading

cloudwu commented May 6, 2019

ocornut commented Mar 3, 2020

cloudwu commented Mar 4, 2020

About Full Unicode (Plane 1-16) support #2538

About Full Unicode (Plane 1-16) support #2538

Comments

cloudwu commented May 6, 2019

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 • edited Loading

ocornut commented May 6, 2019 • edited Loading

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 • edited Loading

ocornut commented May 6, 2019

cloudwu commented May 6, 2019 • edited Loading

ocornut commented May 6, 2019 • edited Loading

cloudwu commented May 6, 2019

ocornut commented Mar 3, 2020

cloudwu commented Mar 4, 2020

cloudwu commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019 •

edited

Loading

cloudwu commented May 6, 2019 •

edited

Loading

cloudwu commented May 6, 2019 •

edited

Loading

ocornut commented May 6, 2019 •

edited

Loading