Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 characters are not displayed / read properly on Mac OS X Snow Leopard #180

Closed
siefca opened this issue Nov 19, 2013 · 93 comments
Closed
Labels

Comments

@siefca
Copy link
Contributor

siefca commented Nov 19, 2013

Hi,

I'm using Mac OS X Snow Leopard and Ruby 1.9.3-p448 with ncurses library v5.9 from https://github.com/homebrew/homebrew-dupes/commits/master/ncurses.rb

I followed the installation instructions from Wiki. Everything works great but I am unable to enter any UTF-8 character into text line in sup (searching, entering Subject, To or other lines etc.). UTF-8 characters in "pager" (i.e. inbox mode and when viewing messages) are properly displayed; there is a problem with input line.

I thought that's something with input handling (meta key) but copying and pasting UTF-8 characters also generates mojibakes (unreadable characters):

sup-utf8

BTW, ViM and other tools – like irb for instance – compiled with the same ncurses library are displaying characters properly.

Please help me and save all the pandas from around the world. :]

PS: I changed scroll_mode.rb a bit and put UTF-8 character into prompt – it's displayed properly:

sup-utf8-tweak

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

Are you using latest Sup 0.15?

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

Hmm, maybe nonblocking_getch is too greedy during reading characters that have more than one byte?

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

I'm using sup v0.15.0.

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

What encoding do you have set? Typically shown before Sup starts or in ~/.sup/log. The get char stuff is a bit hacky because ncurses only gives us ASCII-8bit chars without any encoding information, so we fix_encoding! it. Check out:
lib/sup/textfield.rb:172.

Also; what terminal are you using and what encoding is it set to? Does it match your SHELL encoding variables?

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

Yeah, but when you do IO.select() and then getch you'll always get single byte, not single character.

EDIT: or not :> i'll test it returning .ord or something.

I'm collecting info you requested, 1 sec.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

$ locale

LANG="pl_PL.UTF-8"
LC_COLLATE="pl_PL.UTF-8"
LC_CTYPE="pl_PL.UTF-8"
LC_MESSAGES="pl_PL.UTF-8"
LC_MONETARY="pl_PL.UTF-8"
LC_NUMERIC="pl_PL.UTF-8"
LC_TIME="pl_PL.UTF-8"
LC_ALL="pl_PL.UTF-8"

$ echo $TERM
xterm

The terminal emulation program is iTerm.

[2013-11-19 13:43:13 +0100] Welcome to Sup! Log level is set to info.
[2013-11-19 13:51:07 +0100] using character set encoding "UTF-8"
[2013-11-19 13:51:08 +0100] dynamically loading setlocale() from libc.dylib
[2013-11-19 13:51:08 +0100] setting locale...
[2013-11-19 13:51:08 +0100] locking /****/.sup/lock...
[2013-11-19 13:51:42 +0100] locking /****/.sup/lock...
[2013-11-19 13:51:42 +0100] no draft source, auto-adding...
[2013-11-19 13:51:42 +0100] starting curses
[2013-11-19 13:51:42 +0100] loading user colors from /****/.sup/colors.yaml
[2013-11-19 13:51:42 +0100] initializing log buffer
[2013-11-19 13:51:42 +0100] Welcome to Sup! Log level is set to debug.
[2013-11-19 13:51:42 +0100] initializing inbox buffer
[2013-11-19 13:51:42 +0100] ready for interaction!
...
[2013-11-19 13:54:43 +0100] stopped cursing
[2013-11-19 13:54:43 +0100] no fatal errors. good job, william.
[2013-11-19 13:54:43 +0100] saving index and sources...
[2013-11-19 13:54:43 +0100] Flushing Xapian updates to disk. This may take a while...
[2013-11-19 13:54:43 +0100] unlocking /****/.sup/lock...

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

Hm, ok. I'll check later with my own Mac if the behavior is the same. I assume this is iTerm2.

If you know how to fix it please do. The stuff that allows the use of old ncursesw-ruby is unnecessary and can be removed [0]. We require ncursesw and our own ncursesw gem as well as a recent ncurses library.

[0] https://github.com/sup-heliotrope/sup/blob/develop/lib/sup/buffer.rb#L36

By the way, as far as I know: xterm isn't necessarily UTF8, at least on Unix. Might be that Mac OS X Snow Leopard has a termcap/info file that makes it UTF8 though.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

I've put some debug there and there are two bytes returned instead of one character code.

When typing space Ncurses.getch returns 32

When typing 'ą' letter (a with ogonek) Ncurses.getch returns 196 and then 133 in next call.
Each byte code is returned separately. I'll try to fix it.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

I just found that there is getwch function in curses designed to get wide characters from input, but it's not bound in ncursesw-sup.

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

I had simply assumed that they were the ones that were wrapped to when configured for wide chars, apparently not. If there needs to be modifications to ncursesw-sup we'll just release a new version of it along with next sup.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

For now I'm playing with Ncurses.getch and the interesting thing is it's not returning codepoints (in case of wide characters) but codes of binary chunks of wide characters. Just failed on using pack('U*') to compile character, succeeded with raw pack('C*').

Until fix in ncursesw-sup we could have some workaround there.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

I was able to modify nonblocking_getch from buffer.rb so it returns proper decimal codes for multibyte UTF-8 characters.

However, handle_input from textfield.rb that is using nonblocking_getch to get characters is passing them in decimal form to Ncurses::Form.form_driver and that value is different than returned before. I'm worrying that the decimal codes are overlapping with Ncurses control codes. Checking…

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

Ok, I wrapped the get_wch function ihere: sup-heliotrope/ncursesw-ruby#12 you could try and build the gem and implement it using that function.

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

acutally hang on, its not complete yet..

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

Unfortunately passing decimal values of UTF-8 characters that are returned by modified nonblocking_getch conflicts with special codes, i.e. Polish 'ą' (decimal code: 261) is the same as Ncurses::KEY_RIGHT. In order for this workaround to work I would have to add some big value and then substract it just before it's passed to form_driver.

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

Ok, the wrapper should be ok in sup-heliotrope/ncursesw-ruby#12 now. Does that help?

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

Ok, tested. It's working (get_wch is returning decimal values for UTF-8 characters), but still there is a conflict with Ncurses::KEY_RIGHT and probably other special codes.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

I see two solutions:

  1. Use my implementation of get_wch, which could add some magic value of 100000 in order to detect it later and have UTF-8 character after substraction or decorate the returning object with some flag that tells it was multibyte.
  2. Use native get_wch but somehow tweak defines of key codes and make them negative. [probably bad idea]
  3. Compile bytes of multibyte chars later. [might be messy - spread in couple places]

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

I could also implement get_wch correctly, it returns whether it is a
keycode or not. Would have to change the logic a bit though.
19. nov. 2013 19:23 skrev "Paweł Wilk" [email protected] følgende:

I see two solutions:

Use my implementation of get_wch, which could add some magic value of
100000 in order to detect it later and have UTF-8 character after
substraction or decorate the returning object with some flag that tells it
was multibyte.
2.

Use native get_wch but somehow tweak defines of key codes and make
them negative. [probably bad idea]
3.

Compile bytes of multibyte chars later. [might be messy - spread in
couple places]


Reply to this email directly or view it on GitHubhttps://github.com//issues/180#issuecomment-28817639
.

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

That would be great, but will not help much since form_driver might be buggy.

I've created small class called Ncurses::CharCode which stores character codes along with some multibyte flags to let other methods know what's coming on input. That kills a bug with displaying invalid characters but due to mentioned problem in other component it's not solving the problem definitely.

https://github.com/siefca/sup/tree/fix_input_multibyte_characters

The real problem now is in Ncurses::Form.form_driver method. It doesn't work with some characters (or something is wrong with my setup). The multibyte characters are transported correctly but some of them, when passed to this method, aren't displayed.

I've found some forum posts that there is wide- version of libform but I don't know how to check if it's in use or not (or maybe there are some flags that can cause it to properly operate on multibyte characters). Some of discussions:

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

Yeah, it seems you are right. See latest commit in the ncursesw pull request for where I return the key type as well. It doesn't really help, since the form_driver messes things up.

@gauteh
Copy link
Member

gauteh commented Nov 19, 2013

I think formw is part of libncursesw; but you can modify extconf.rb in the gem to fail if it doesn't find it:

diff --git a/extconf.rb b/extconf.rb
index 1c60bbd..74b5464 100644
--- a/extconf.rb
+++ b/extconf.rb
@@ -134,7 +134,9 @@ end

 puts "checking for the form library..."
 if have_header("form.h")
-  have_library("formw", "new_form")
+  if not have_library("formw", "new_form")
+    raise "formw library not found"
+  end
 else
   raise "form library not found."
 end

@siefca
Copy link
Contributor Author

siefca commented Nov 19, 2013

BTW, is there some version of form_driver that takes wchar_t instead of int?

@siefca
Copy link
Contributor Author

siefca commented Nov 20, 2013

Re, I'm trying to use get_wch but it returns so called wide character, which is hard to transform into some encoding (probably it's UTF-32BE but I'm not sure). Is there any chance you could wrap some function that would return an array with the second element being UTF-8 version of a character or its decimal representation?

@gauteh
Copy link
Member

gauteh commented Nov 20, 2013

I changed it to return the wide char as a Fixnum (gauteh/ncursesw-ruby@4481f31). As far as I can understand the widechar encoding depends on the input from the terminal (terminal codeset + encoding), so it should be interpreted based on that.

@gauteh
Copy link
Member

gauteh commented Nov 20, 2013

Maybe we need to use the form set_field functions when it is a regular char. form_driver does not seem to have a wide char equivalent.

@gauteh
Copy link
Member

gauteh commented Nov 20, 2013

By the way, does it fix your problem if you set:

    LibC.setlocale(6, "")  # LC_ALL == 6

to:

    LibC.setlocale(6, "pl_PL.UTF-8")  # LC_ALL == 6

in bin/sup?

@gauteh
Copy link
Member

gauteh commented Nov 20, 2013

I can make the example examples/form_get_wch.rb which I made from examples/form.rb in ncursesw-sup to work by setting the field, and packing the input as a unicode char (check latest commit on https://github.com/gauteh/ncursesw-ruby/compare/bind_get_wch). If the gem is linked against ncursesw and formw that should work.

@gauteh
Copy link
Member

gauteh commented Nov 20, 2013

The IO.select wrapping in https://github.com/siefca/sup/blob/fix_input_multibyte_characters/lib/sup/buffer.rb#L73 should not be necessary for nonblocking anymore. It should also be removed in the final patch.

@siefca
Copy link
Contributor Author

siefca commented Nov 22, 2013

Your idea to add the char using add_wch or to set the field buf and form_post move right is the right path.

@gauteh
Copy link
Member

gauteh commented Nov 22, 2013

Yeah. Actually form_get_wch works for Value2 on iTerm2 now. But not on Linux.

@siefca
Copy link
Contributor Author

siefca commented Nov 22, 2013

Out of curiosity I compiled ncruses with tracing.

This is for ó character that is displayed properly:


called {form_driver(0x101874530,243)
+ called {Data_Entry(0x101874530,{'ó' = 0363})
+ + called {werase(0x101874750)
+ + return }0
+ + called {wmove(0x101874750,0,9)
+ + return }0
+ + called {winch(0x101874750)
+ + return }{' ' = 040}
+ + called {wmove(0x101874750,0,0)
+ + return }0
+ + called {winsch(0x101874750, {'ó' = 0363})
render_char bkg {{ ' ' = 040 }} (0), attrs {A_NORMAL} (0) -> ch {{ 'Ã' = 0303, '³' = 0263 }} (0)
+ + return }0
+ + called {IFN_Next_Character(0x101874530)
+ + return }0
+ return }0
+ called {_nc_Refresh_Current_Field(0x101874530)
+ + called {wsyncup(0x101874750)
+ + return }
+ + called {wtouchln(0x101874750,0,1,0)
+ + return }0
+ + called {wmove(0x101874750,0,1)
+ + return }0
+ + called {wcursyncup(0x101874750)
+ + + called {wmove(0x1006cc3c0,4,19)
+ + + return }0
+ + return }
+ return }0
return }0

This is for ą character which is not displayed:

called {form_driver(0x10187d060,261)
+ called {_nc_Refresh_Current_Field(0x10187d060)
+ + called {wsyncup(0x10187d270)
+ + return }
+ + called {wtouchln(0x10187d270,0,1,0)
+ + return }0
+ + called {wmove(0x10187d270,0,0)
+ + return }0
+ + called {wcursyncup(0x10187d270)
+ + + called {wmove(0x101852700,4,18)
+ + + return }0
+ + return }
+ return }0
return }-8

The magic value of -8 is E_UNKNOWN_COMMAND. Failing line is:

if (!(c & (~(int)MAX_REGULAR_CHARACTER)))

MAX_REGULAR_CHARACTER is 0xff (binary 11111111).

The ą along with some other UTF-8 characters will always fails this test. Look:

'ą'.ord.to_s(2)
 # => "100000101"

'ł'.ord.to_s(2)
# => "101000010"

'ó'.ord.to_s(2)         # this one passes and will be displayed
# => "011110011"

But that's not the end. Even if it would pass the conditional statement is then evaluated:

if (!iscntrl(UChar(c)))

Which fails.

Uchar(c) is a synonym for (unsigned char) c and iscntrl() is a function that checks if some character is a control character (not printable). It's defined in ctype.c from libc and uses __ctype_b_C[] array defined in C-ctype_ct.c from locale subdirectory. By default it detects many characters as control characters (including ą and ł) and by design it should use different arrays, depending on current locale. My guess is it doesn't do that or it does but it still overlaps.

@siefca
Copy link
Contributor Author

siefca commented Nov 22, 2013

Yeah, LC_CTYPE is a key, but seems it doesn't open the right door. On Mac OS X the mapping may be broken. I'll try mklocale from debian to regenerate CTYPE locale file for UTF-8 or for pl_PL.UTF-8.

BTW:

@siefca
Copy link
Contributor Author

siefca commented Nov 22, 2013

@siefca
Copy link
Contributor Author

siefca commented Nov 22, 2013

I patched ncurses to use iswcntrl but no luck. Looking for charmaps.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

I patched ncurses to use isrune and removed first condition if wide character support is enabled but it seems that Data_Entry() that is called in form_driver() cannot render it and prints control code keys mapped to ASCII. :/

We can assume that form_driver is not multibyte compliant, at least on OS X.

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

Im working on a patched version of ncurses with wide versions of both form_driver_w and Data_Entry_w, i'll push it to github in a moment with an example.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

👍

BTW, if !iscntrl or !iswcntrl is giving false positives, try iswrune.

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

Check out: gauteh/ncurses@master...form_driver_w

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

For me the example form_get_wch.rb from the latest link ☝️ works with form_driver_w in both Linux and Mac OS X.

I installed my version of ncurses using brew:

$ brew edit ncurses

and change url to: https://github.com/gauteh/ncurses/archive/form_driver_w.zip (comment out mirror and sha1s).

$ brew reinstall ncurses
$ brew link ncurses --force

Then check out the branch above; build and install the gem; and test the form_get_wch.rb script.

Since it is relatively easy to use this version of Ncurses through brew we might just check in Sup whether form_driver_w is available, use it or otherwise fall back to form_driver.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

Ok, installed. Checking how sup will integrate.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

!!!!!!!!!!!!!!!!!!!!!!!!

zrzut ekranu 2013-11-23 godz 17 29 41

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

Cool, now it remains to see if get_cursed_value returns something sensible.. 😀

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

Ok, I moved some stuff to another file. I'll check it after I eat something.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

  1. What do you think is better: replace Form.form_driver by Form.form_driver_w everywhere (and rewrite Form.form_driver_w if there is no support – like it's done now) or rewrite Form.form_driver method if Form.form_driver_w is present?

    In first case form_driver_w will be used everywhere, in second form_driver will remain. IMHO the first option since it will clearly tell developers that wide version is used and may not behave like standard version.

  2. How about moving Ncurses::CharCode to ncurses-ruby?

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

  • Replace with form_driver_w - as you have done now, but everywhere.

I have submitted form_driver_w upstream, there may be no hope of it getting merged, in which case form_driver_w patched into the brew recipe in Mac OS X might be a workaround that people have to do.

  • I think it should stay in Sup. It is too specific, and would be too high-level for for ncursesw-sup.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

Ok, ok.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

I have a strange problem in testing if Ncurses::Form.form_driver_w exists. The module Ncurses::Form seems to not have such method defined when it's queried in module context. I checked with Ncurses::Form.form_driver and it's also missing.

require 'ncursesw'
require 'sup/util'

p Ncurses::Form.form_driver

gives:

undefined method `form_driver' for Ncurses::Form:Module (NoMethodError)

@gauteh
Copy link
Member

gauteh commented Nov 23, 2013

The function isnt defined before after init has been called on Ncurses.
23. nov. 2013 22:39 skrev "Paweł Wilk" [email protected] følgende:

I have a strange problem in testing if Ncurses::Form.form_driver_wexists. The module
Ncurses::Form seems to not have such method defined when it's queried in
module context. I checked with Ncurses::Form.form_driver and it's also
missing.

require 'ncursesw'require 'sup/util'
p Ncurses::Form.form_driver

gives:

undefined methodform_driver' for Ncurses::Form:Module (NoMethodError)`


Reply to this email directly or view it on GitHubhttps://github.com//issues/180#issuecomment-29142325
.

@siefca
Copy link
Contributor Author

siefca commented Nov 23, 2013

Ok, I'll create checker module method and call it after initscr.

@siefca
Copy link
Contributor Author

siefca commented Nov 24, 2013

Ok. Buffer is now form_driver_w-ized.

@gauteh
Copy link
Member

gauteh commented Nov 24, 2013

Nice work, seems to work fine on my Arch Linux:
2013-11-24-014314_1366x768_scrot

Lets hold it a while, do some testing, and see if we get a response from Ncurses upstream. If we do, we surely will have to refine it ;)

@siefca
Copy link
Contributor Author

siefca commented Nov 24, 2013

Nice coding with you. Thanks!

@gauteh
Copy link
Member

gauteh commented Nov 28, 2013

Does it work without NCURSES_OPAQUE set? https://github.com/sup-heliotrope/ncursesw-ruby/blob/master/extconf.rb#L154

@siefca
Copy link
Contributor Author

siefca commented Nov 28, 2013

It's working (diacritics), but on Mac OS enabling NCURSES_OPAQUE causes lags during loading threads.

@gauteh
Copy link
Member

gauteh commented Dec 16, 2013

Fixed in 7e7858c.

@gauteh gauteh closed this as completed Dec 16, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants