UTF-8 characters are not displayed / read properly on Mac OS X Snow Leopard #180

siefca · 2013-11-19T11:54:21Z

Hi,

I'm using Mac OS X Snow Leopard and Ruby 1.9.3-p448 with ncurses library v5.9 from https://github.com/homebrew/homebrew-dupes/commits/master/ncurses.rb

I followed the installation instructions from Wiki. Everything works great but I am unable to enter any UTF-8 character into text line in sup (searching, entering Subject, To or other lines etc.). UTF-8 characters in "pager" (i.e. inbox mode and when viewing messages) are properly displayed; there is a problem with input line.

I thought that's something with input handling (meta key) but copying and pasting UTF-8 characters also generates mojibakes (unreadable characters):

BTW, ViM and other tools – like irb for instance – compiled with the same ncurses library are displaying characters properly.

Please help me and save all the pandas from around the world. :]

PS: I changed scroll_mode.rb a bit and put UTF-8 character into prompt – it's displayed properly:

The text was updated successfully, but these errors were encountered:

gauteh · 2013-11-19T12:22:55Z

Are you using latest Sup 0.15?

siefca · 2013-11-19T12:23:45Z

Hmm, maybe nonblocking_getch is too greedy during reading characters that have more than one byte?

siefca · 2013-11-19T12:24:18Z

I'm using sup v0.15.0.

gauteh · 2013-11-19T12:31:07Z

What encoding do you have set? Typically shown before Sup starts or in ~/.sup/log. The get char stuff is a bit hacky because ncurses only gives us ASCII-8bit chars without any encoding information, so we fix_encoding! it. Check out:
lib/sup/textfield.rb:172.

Also; what terminal are you using and what encoding is it set to? Does it match your SHELL encoding variables?

siefca · 2013-11-19T12:45:33Z

Yeah, but when you do IO.select() and then getch you'll always get single byte, not single character.

EDIT: or not :> i'll test it returning .ord or something.

I'm collecting info you requested, 1 sec.

siefca · 2013-11-19T13:02:20Z

$ locale

LANG="pl_PL.UTF-8"
LC_COLLATE="pl_PL.UTF-8"
LC_CTYPE="pl_PL.UTF-8"
LC_MESSAGES="pl_PL.UTF-8"
LC_MONETARY="pl_PL.UTF-8"
LC_NUMERIC="pl_PL.UTF-8"
LC_TIME="pl_PL.UTF-8"
LC_ALL="pl_PL.UTF-8"

$ echo $TERM
xterm

The terminal emulation program is iTerm.

[2013-11-19 13:43:13 +0100] Welcome to Sup! Log level is set to info.
[2013-11-19 13:51:07 +0100] using character set encoding "UTF-8"
[2013-11-19 13:51:08 +0100] dynamically loading setlocale() from libc.dylib
[2013-11-19 13:51:08 +0100] setting locale...
[2013-11-19 13:51:08 +0100] locking /****/.sup/lock...
[2013-11-19 13:51:42 +0100] locking /****/.sup/lock...
[2013-11-19 13:51:42 +0100] no draft source, auto-adding...
[2013-11-19 13:51:42 +0100] starting curses
[2013-11-19 13:51:42 +0100] loading user colors from /****/.sup/colors.yaml
[2013-11-19 13:51:42 +0100] initializing log buffer
[2013-11-19 13:51:42 +0100] Welcome to Sup! Log level is set to debug.
[2013-11-19 13:51:42 +0100] initializing inbox buffer
[2013-11-19 13:51:42 +0100] ready for interaction!
...
[2013-11-19 13:54:43 +0100] stopped cursing
[2013-11-19 13:54:43 +0100] no fatal errors. good job, william.
[2013-11-19 13:54:43 +0100] saving index and sources...
[2013-11-19 13:54:43 +0100] Flushing Xapian updates to disk. This may take a while...
[2013-11-19 13:54:43 +0100] unlocking /****/.sup/lock...

gauteh · 2013-11-19T13:14:08Z

Hm, ok. I'll check later with my own Mac if the behavior is the same. I assume this is iTerm2.

If you know how to fix it please do. The stuff that allows the use of old ncursesw-ruby is unnecessary and can be removed [0]. We require ncursesw and our own ncursesw gem as well as a recent ncurses library.

[0] https://github.com/sup-heliotrope/sup/blob/develop/lib/sup/buffer.rb#L36

By the way, as far as I know: xterm isn't necessarily UTF8, at least on Unix. Might be that Mac OS X Snow Leopard has a termcap/info file that makes it UTF8 though.

siefca · 2013-11-19T13:30:41Z

I've put some debug there and there are two bytes returned instead of one character code.

When typing space Ncurses.getch returns 32

When typing 'ą' letter (a with ogonek) Ncurses.getch returns 196 and then 133 in next call.
Each byte code is returned separately. I'll try to fix it.

siefca · 2013-11-19T15:38:18Z

I just found that there is getwch function in curses designed to get wide characters from input, but it's not bound in ncursesw-sup.

gauteh · 2013-11-19T15:48:02Z

I had simply assumed that they were the ones that were wrapped to when configured for wide chars, apparently not. If there needs to be modifications to ncursesw-sup we'll just release a new version of it along with next sup.

siefca · 2013-11-19T16:06:08Z

For now I'm playing with Ncurses.getch and the interesting thing is it's not returning codepoints (in case of wide characters) but codes of binary chunks of wide characters. Just failed on using pack('U*') to compile character, succeeded with raw pack('C*').

Until fix in ncursesw-sup we could have some workaround there.

siefca · 2013-11-19T16:52:57Z

I was able to modify nonblocking_getch from buffer.rb so it returns proper decimal codes for multibyte UTF-8 characters.

However, handle_input from textfield.rb that is using nonblocking_getch to get characters is passing them in decimal form to Ncurses::Form.form_driver and that value is different than returned before. I'm worrying that the decimal codes are overlapping with Ncurses control codes. Checking…

gauteh · 2013-11-19T16:55:53Z

Ok, I wrapped the get_wch function ihere: sup-heliotrope/ncursesw-ruby#12 you could try and build the gem and implement it using that function.

gauteh · 2013-11-19T16:58:38Z

acutally hang on, its not complete yet..

siefca · 2013-11-19T17:30:24Z

Unfortunately passing decimal values of UTF-8 characters that are returned by modified nonblocking_getch conflicts with special codes, i.e. Polish 'ą' (decimal code: 261) is the same as Ncurses::KEY_RIGHT. In order for this workaround to work I would have to add some big value and then substract it just before it's passed to form_driver.

gauteh · 2013-11-19T17:37:16Z

Ok, the wrapper should be ok in sup-heliotrope/ncursesw-ruby#12 now. Does that help?

siefca · 2013-11-19T18:02:57Z

Ok, tested. It's working (get_wch is returning decimal values for UTF-8 characters), but still there is a conflict with Ncurses::KEY_RIGHT and probably other special codes.

siefca · 2013-11-19T18:23:37Z

I see two solutions:

Use my implementation of get_wch, which could add some magic value of 100000 in order to detect it later and have UTF-8 character after substraction or decorate the returning object with some flag that tells it was multibyte.
Use native get_wch but somehow tweak defines of key codes and make them negative. [probably bad idea]
Compile bytes of multibyte chars later. [might be messy - spread in couple places]

gauteh · 2013-11-19T21:06:13Z

I could also implement get_wch correctly, it returns whether it is a
keycode or not. Would have to change the logic a bit though.
19. nov. 2013 19:23 skrev "Paweł Wilk" [email protected] følgende:

I see two solutions:

Use my implementation of get_wch, which could add some magic value of
100000 in order to detect it later and have UTF-8 character after
substraction or decorate the returning object with some flag that tells it
was multibyte.
2.

Use native get_wch but somehow tweak defines of key codes and make
them negative. [probably bad idea]
3.

Compile bytes of multibyte chars later. [might be messy - spread in
couple places]

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/180#issuecomment-28817639
.

siefca · 2013-11-19T22:31:20Z

That would be great, but will not help much since form_driver might be buggy.

I've created small class called Ncurses::CharCode which stores character codes along with some multibyte flags to let other methods know what's coming on input. That kills a bug with displaying invalid characters but due to mentioned problem in other component it's not solving the problem definitely.

https://github.com/siefca/sup/tree/fix_input_multibyte_characters

The real problem now is in Ncurses::Form.form_driver method. It doesn't work with some characters (or something is wrong with my setup). The multibyte characters are transported correctly but some of them, when passed to this method, aren't displayed.

I've found some forum posts that there is wide- version of libform but I don't know how to check if it's in use or not (or maybe there are some flags that can cause it to properly operate on multibyte characters). Some of discussions:

gauteh · 2013-11-19T23:03:59Z

Yeah, it seems you are right. See latest commit in the ncursesw pull request for where I return the key type as well. It doesn't really help, since the form_driver messes things up.

gauteh · 2013-11-19T23:05:46Z

I think formw is part of libncursesw; but you can modify extconf.rb in the gem to fail if it doesn't find it:

diff --git a/extconf.rb b/extconf.rb
index 1c60bbd..74b5464 100644
--- a/extconf.rb
+++ b/extconf.rb
@@ -134,7 +134,9 @@ end

 puts "checking for the form library..."
 if have_header("form.h")
-  have_library("formw", "new_form")
+  if not have_library("formw", "new_form")
+    raise "formw library not found"
+  end
 else
   raise "form library not found."
 end

siefca · 2013-11-19T23:49:23Z

BTW, is there some version of form_driver that takes wchar_t instead of int?

siefca · 2013-11-20T02:53:57Z

Re, I'm trying to use get_wch but it returns so called wide character, which is hard to transform into some encoding (probably it's UTF-32BE but I'm not sure). Is there any chance you could wrap some function that would return an array with the second element being UTF-8 version of a character or its decimal representation?

gauteh · 2013-11-20T06:24:50Z

I changed it to return the wide char as a Fixnum (gauteh/ncursesw-ruby@4481f31). As far as I can understand the widechar encoding depends on the input from the terminal (terminal codeset + encoding), so it should be interpreted based on that.

gauteh · 2013-11-20T06:35:20Z

Maybe we need to use the form set_field functions when it is a regular char. form_driver does not seem to have a wide char equivalent.

gauteh · 2013-11-20T08:13:55Z

By the way, does it fix your problem if you set:

    LibC.setlocale(6, "")  # LC_ALL == 6

to:

    LibC.setlocale(6, "pl_PL.UTF-8")  # LC_ALL == 6

in bin/sup?

gauteh · 2013-11-20T09:07:37Z

I can make the example examples/form_get_wch.rb which I made from examples/form.rb in ncursesw-sup to work by setting the field, and packing the input as a unicode char (check latest commit on https://github.com/gauteh/ncursesw-ruby/compare/bind_get_wch). If the gem is linked against ncursesw and formw that should work.

http://stackoverflow.com/questions/6976524/convert-unicode-codepoint-to-string-character-in-ruby

gauteh · 2013-11-20T10:13:14Z

The IO.select wrapping in https://github.com/siefca/sup/blob/fix_input_multibyte_characters/lib/sup/buffer.rb#L73 should not be necessary for nonblocking anymore. It should also be removed in the final patch.

siefca · 2013-11-22T14:43:13Z

Your idea to add the char using add_wch or to set the field buf and form_post move right is the right path.

gauteh · 2013-11-22T15:32:16Z

Yeah. Actually form_get_wch works for Value2 on iTerm2 now. But not on Linux.

siefca · 2013-11-22T18:43:59Z

Out of curiosity I compiled ncruses with tracing.

This is for ó character that is displayed properly:


called {form_driver(0x101874530,243)
+ called {Data_Entry(0x101874530,{'ó' = 0363})
+ + called {werase(0x101874750)
+ + return }0
+ + called {wmove(0x101874750,0,9)
+ + return }0
+ + called {winch(0x101874750)
+ + return }{' ' = 040}
+ + called {wmove(0x101874750,0,0)
+ + return }0
+ + called {winsch(0x101874750, {'ó' = 0363})
render_char bkg {{ ' ' = 040 }} (0), attrs {A_NORMAL} (0) -> ch {{ 'Ã' = 0303, '³' = 0263 }} (0)
+ + return }0
+ + called {IFN_Next_Character(0x101874530)
+ + return }0
+ return }0
+ called {_nc_Refresh_Current_Field(0x101874530)
+ + called {wsyncup(0x101874750)
+ + return }
+ + called {wtouchln(0x101874750,0,1,0)
+ + return }0
+ + called {wmove(0x101874750,0,1)
+ + return }0
+ + called {wcursyncup(0x101874750)
+ + + called {wmove(0x1006cc3c0,4,19)
+ + + return }0
+ + return }
+ return }0
return }0

This is for ą character which is not displayed:

called {form_driver(0x10187d060,261)
+ called {_nc_Refresh_Current_Field(0x10187d060)
+ + called {wsyncup(0x10187d270)
+ + return }
+ + called {wtouchln(0x10187d270,0,1,0)
+ + return }0
+ + called {wmove(0x10187d270,0,0)
+ + return }0
+ + called {wcursyncup(0x10187d270)
+ + + called {wmove(0x101852700,4,18)
+ + + return }0
+ + return }
+ return }0
return }-8

The magic value of -8 is E_UNKNOWN_COMMAND. Failing line is:

if (!(c & (~(int)MAX_REGULAR_CHARACTER)))

MAX_REGULAR_CHARACTER is 0xff (binary 11111111).

The ą along with some other UTF-8 characters will always fails this test. Look:

'ą'.ord.to_s(2)
 # => "100000101"

'ł'.ord.to_s(2)
# => "101000010"

'ó'.ord.to_s(2)         # this one passes and will be displayed
# => "011110011"

But that's not the end. Even if it would pass the conditional statement is then evaluated:

if (!iscntrl(UChar(c)))

Which fails.

Uchar(c) is a synonym for (unsigned char) c and iscntrl() is a function that checks if some character is a control character (not printable). It's defined in ctype.c from libc and uses __ctype_b_C[] array defined in C-ctype_ct.c from locale subdirectory. By default it detects many characters as control characters (including ą and ł) and by design it should use different arrays, depending on current locale. My guess is it doesn't do that or it does but it still overlaps.

siefca · 2013-11-22T22:11:02Z

Yeah, LC_CTYPE is a key, but seems it doesn't open the right door. On Mac OS X the mapping may be broken. I'll try mklocale from debian to regenerate CTYPE locale file for UTF-8 or for pl_PL.UTF-8.

BTW:

siefca · 2013-11-22T22:36:44Z

Seems that ncurses is not using iswctntrl nor iswcntrl_l. These are wide- variants of iscntrl.

siefca · 2013-11-22T23:42:31Z

I patched ncurses to use iswcntrl but no luck. Looking for charmaps.

siefca · 2013-11-23T13:57:03Z

I patched ncurses to use isrune and removed first condition if wide character support is enabled but it seems that Data_Entry() that is called in form_driver() cannot render it and prints control code keys mapped to ASCII. :/

We can assume that form_driver is not multibyte compliant, at least on OS X.

gauteh · 2013-11-23T14:03:39Z

Im working on a patched version of ncurses with wide versions of both form_driver_w and Data_Entry_w, i'll push it to github in a moment with an example.

siefca · 2013-11-23T14:06:08Z

👍

BTW, if !iscntrl or !iswcntrl is giving false positives, try iswrune.

gauteh · 2013-11-23T14:40:43Z

Check out: gauteh/ncurses@master...form_driver_w

gauteh · 2013-11-23T15:15:17Z

Also example here: https://github.com/gauteh/ncursesw-ruby/tree/add_wch_and_form_driver_w

gauteh · 2013-11-23T15:20:51Z

For me the example form_get_wch.rb from the latest link ☝️ works with form_driver_w in both Linux and Mac OS X.

I installed my version of ncurses using brew:

$ brew edit ncurses

and change url to: https://github.com/gauteh/ncurses/archive/form_driver_w.zip (comment out mirror and sha1s).

$ brew reinstall ncurses
$ brew link ncurses --force

Then check out the branch above; build and install the gem; and test the form_get_wch.rb script.

Since it is relatively easy to use this version of Ncurses through brew we might just check in Sup whether form_driver_w is available, use it or otherwise fall back to form_driver.

siefca · 2013-11-23T16:19:47Z

Ok, installed. Checking how sup will integrate.

siefca · 2013-11-23T16:30:26Z

!!!!!!!!!!!!!!!!!!!!!!!!

gauteh · 2013-11-23T18:36:34Z

Cool, now it remains to see if get_cursed_value returns something sensible.. 😀

siefca · 2013-11-23T20:00:54Z

Ok, I moved some stuff to another file. I'll check it after I eat something.

siefca · 2013-11-23T20:31:42Z

What do you think is better: replace Form.form_driver by Form.form_driver_w everywhere (and rewrite Form.form_driver_w if there is no support – like it's done now) or rewrite Form.form_driver method if Form.form_driver_w is present?

In first case form_driver_w will be used everywhere, in second form_driver will remain. IMHO the first option since it will clearly tell developers that wide version is used and may not behave like standard version.
How about moving Ncurses::CharCode to ncurses-ruby?

gauteh · 2013-11-23T20:45:07Z

Replace with form_driver_w - as you have done now, but everywhere.

I have submitted form_driver_w upstream, there may be no hope of it getting merged, in which case form_driver_w patched into the brew recipe in Mac OS X might be a workaround that people have to do.

I think it should stay in Sup. It is too specific, and would be too high-level for for ncursesw-sup.

siefca · 2013-11-23T20:48:52Z

Ok, ok.

siefca · 2013-11-23T21:39:37Z

I have a strange problem in testing if Ncurses::Form.form_driver_w exists. The module Ncurses::Form seems to not have such method defined when it's queried in module context. I checked with Ncurses::Form.form_driver and it's also missing.

require 'ncursesw'
require 'sup/util'

p Ncurses::Form.form_driver

gives:

undefined method `form_driver' for Ncurses::Form:Module (NoMethodError)

gauteh · 2013-11-23T21:42:19Z

The function isnt defined before after init has been called on Ncurses.
23. nov. 2013 22:39 skrev "Paweł Wilk" [email protected] følgende:

I have a strange problem in testing if Ncurses::Form.form_driver_wexists. The module
Ncurses::Form seems to not have such method defined when it's queried in
module context. I checked with Ncurses::Form.form_driver and it's also
missing.

require 'ncursesw'require 'sup/util'
p Ncurses::Form.form_driver

gives:

undefined methodform_driver' for Ncurses::Form:Module (NoMethodError)`

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/180#issuecomment-29142325
.

siefca · 2013-11-23T21:48:10Z

Ok, I'll create checker module method and call it after initscr.

siefca · 2013-11-24T00:27:47Z

Ok. Buffer is now form_driver_w-ized.

gauteh · 2013-11-24T00:47:32Z

Nice work, seems to work fine on my Arch Linux:

Lets hold it a while, do some testing, and see if we get a response from Ncurses upstream. If we do, we surely will have to refine it ;)

siefca · 2013-11-24T02:15:44Z

Nice coding with you. Thanks!

gauteh · 2013-11-28T14:09:14Z

Does it work without NCURSES_OPAQUE set? https://github.com/sup-heliotrope/ncursesw-ruby/blob/master/extconf.rb#L154

siefca · 2013-11-28T15:08:14Z

It's working (diacritics), but on Mac OS enabling NCURSES_OPAQUE causes lags during loading threads.

gauteh · 2013-12-16T14:48:21Z

Fixed in 7e7858c.

gauteh mentioned this issue Nov 19, 2013

bind get_wch and form_driver_w for wide char input sup-heliotrope/ncursesw-ruby#12

Closed

siefca mentioned this issue Nov 23, 2013

Fix input multibyte characters #183

Closed

gauteh closed this as completed Dec 16, 2013

UTF-8 characters are not displayed / read properly on Mac OS X Snow Leopard #180

UTF-8 characters are not displayed / read properly on Mac OS X Snow Leopard #180

Comments

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

gauteh commented Nov 19, 2013

gauteh commented Nov 19, 2013

siefca commented Nov 19, 2013

siefca commented Nov 20, 2013

gauteh commented Nov 20, 2013

gauteh commented Nov 20, 2013

gauteh commented Nov 20, 2013

gauteh commented Nov 20, 2013

gauteh commented Nov 20, 2013

siefca commented Nov 22, 2013

gauteh commented Nov 22, 2013

siefca commented Nov 22, 2013

siefca commented Nov 22, 2013

siefca commented Nov 22, 2013

siefca commented Nov 22, 2013

siefca commented Nov 23, 2013

gauteh commented Nov 23, 2013

siefca commented Nov 23, 2013

gauteh commented Nov 23, 2013

gauteh commented Nov 23, 2013

gauteh commented Nov 23, 2013

siefca commented Nov 23, 2013

siefca commented Nov 23, 2013

gauteh commented Nov 23, 2013

siefca commented Nov 23, 2013

siefca commented Nov 23, 2013

gauteh commented Nov 23, 2013

siefca commented Nov 23, 2013

siefca commented Nov 23, 2013

gauteh commented Nov 23, 2013

siefca commented Nov 23, 2013

siefca commented Nov 24, 2013

gauteh commented Nov 24, 2013

siefca commented Nov 24, 2013

gauteh commented Nov 28, 2013

siefca commented Nov 28, 2013

gauteh commented Dec 16, 2013