Issue with Displaying Chinese Characters #7

MacErlang · 2020-02-01T21:41:55Z

Hello,

I am using an iMac, and Sabaki seems to have difficulty displaying Chinese characters properly.
This often occurs with sgf files that I downloaded from the internet. Following comments from
another thread, I have tried to add CA[GB2312] to the file, but it did not work.
A sample file is given below. Can someone enlighten me with a solution to this?

Many thanks in advance,
Shun

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[£®“ª£©∆À°¢µπ∆À”Î“™µπ∆À
1£Æ∆À
     ∞◊∆Â‘⁄∫⁄∆Âœ»œ¬µƒ ±∫Ú£¨…˙À¿»Á∫Œƒÿ£ø
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[µƒ∫⁄1µ„£¨∞◊2∂• ±£¨∫⁄3º¥ «∆À£¨];W[rc]C[∞◊4÷ªµ√Ã·£¨∫⁄5≥‘£¨∞◊∆Â±ª…±°£]
;B[se]C[[“™µ„\]£∫À¿ªÓŒ Ã‚÷Æ÷–”√µΩ°∞∆À°±µƒµÿ∑Ω∫‹∂‡£¨∆À « π∂‘∑Ωµƒ—€±‰≥…ºŸ—€µƒ ÷∂Œ°£]
N[“™µ„])

The actual file is attached below, with added .txt file extension:

__Vs__9.sgf.txt

The text was updated successfully, but these errors were encountered:

yishn · 2020-02-01T22:51:16Z

I don't think your file is encoded in GB2312. I've tried opening your file with that encoding in a text editor and got this:

(;AB[pb][pc][pd][pe][qe][rf][sf][qg][pa][qa]AW[ra][rb][qb][qc][qd][re][sd]C[拢庐鈥溌?拢漏鈭喢€掳垄碌蟺鈭喢€鈥澝庘€溾劉碌蟺鈭喢€
1拢脝鈭喢€
     鈭炩棅鈭喢傗€樷亜鈭?鈦勨垎脗艙禄艙卢碌茠聽卤鈭?脷拢篓鈥λ櫭€驴禄脕鈭?艗茠每拢酶
]
AP[MultiGo:4.2.1]SZ[19]MULTIGOGM[1]
;B[sb];W[sc];B[rd]C[碌茠鈭?鈦?1碌鈥灺Ｂㄢ垶鈼?2鈭傗€⒙犅甭Ｂㄢ埆鈦?3潞楼聽芦鈭喢€拢篓];W[rc]C[鈭炩棅4梅陋碌鈭毭兟仿Ｂㄢ埆鈦?5鈮モ€樎Ｂㄢ垶鈼娾垎脗卤陋鈥β甭奥?]
;B[se]C[[鈥溾劉碌鈥瀄]拢鈭?脌驴陋脫艗聽脙鈥毭访喢封€撯€濃垰碌惟掳鈭炩垎脌掳卤碌茠碌每鈭懳┾埆鈥光垈鈥÷Ｂㄢ垎脌聽芦聽蟺鈭傗€樷垜惟碌茠鈥斺偓卤鈥扳墺鈥β号糕€斺偓碌茠聽梅鈭偱捖奥?]
N[鈥溾劉碌鈥瀅)

Not only is this complete gibberish, this is not even valid SGF (in the last line, it's missing a ]).

rooklift · 2020-02-02T00:01:17Z

Couldn't find a valid encoding for this, it may have been corrupted somehow before you got it.

MacErlang · 2020-02-02T06:16:02Z

Folks,

I am grateful for your prompt reply/help. I was a bit hasty in the previous post.
I have now found out that the character encoding is GB18030, not GB2312 (which I took
from an earlier related thread). I have the same issue with numerous files. Attached is a
new zip file, containing three files: Test-Original.sgf, Test-GB18030.sgf, and Test-Unix.sgf.
As the names suggest, the first file is the original, which does not display properly in
Sabaki (Version 0.43.3); the second is a revision of the first, obtained by inserting
CA[GB18030] in the first line; and the third is in Unix format, to be explained below.

As you will see, the Chinese characters in the first file are scrambled.

The second file does display Chinese characters correctly in Sabaki. This is good news.
However, it is tedious to make such a revision for a ton of files. So, it appears that Sabaki
does not "automatically" recognize GB18030 characters. I am a novice, and hence am wondering
whether a "simple" remedy exists for this?

The third file was produced by the following process. First, I used BBEdit to create a new, empty
text file, which is by default using the UTF-8 character set and Unix line endings. (Note that
the first two files use ISO-8859-1 character set and Windows line endings.) Then, I dragged the
original file into a Microsoft Edge or Google Chrome browser window. It turns out that, for both
browsers, the Chinese characters in the original file DO get displayed properly! (This does NOT
work for Safari.) So, it appears that these two browsers are able to automatically detect the
GB18030 characters and hence display them properly. Finally, I just copied and pasted the
(legible) browser content into the empty file created by BBEdit and saved it as Test-Unix.sgf.
This resulting file also opens properly in Sabaki without adding any character declaration, as
Sabaki apparently detects UTF-8 file format automatically. Thus, the third file also works, but
it is even more tedious.

So, the question is what might be a "painless" solution? For example, can Sabaki be made to recognize and properly display GB18030 characters? This would be highly desirable because
I have found that numerous sgf files on the net have this issue (perhaps because they were
produced by old Windows programs.)

Your comments and help are again greatly appreciated.

Best,
Shun

Test.zip

yishn · 2020-02-02T08:59:50Z

This probably stems from the fact that we only consider the first 100 bytes for character encoding detection which in this case does not contain enough Chinese characters. When applying jschardet on the entire buffer, it correctly detects it as GB2312.

@fohristiwhirl I believe you introduced the buffer limit. Can you explain your rationale behind it?

rooklift · 2020-02-02T14:23:07Z

I forget. I think the point might have been that SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd] etc etc etc, but the start of the file is more likely to contain names and such.

I seem to recall this was more of an issue for other file formats. e.g. NGF.

If possible, maybe detect charset using some aggregated comments, metadata etc, e.g. tags C, PW, PB, that sort of thing, joined together into a single string?

MacErlang · 2020-02-02T16:19:13Z

@yishn I have checked a few other files, and your assessment seems valid.

MacErlang · 2020-02-04T03:25:49Z

Great. Does this mean that the fix will be in the next release? Thanks, Shun

…

On Feb 3, 2020, at 8:50 AM, Yichuan Shen ***@***.***> wrote: Closed #7 via 742d8d1. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · 2020-03-12T16:19:01Z

Hello,

Just downloaded and installed the new version, and this problem has not been resolved.
In fact, the new version won't even properly display a file that has been explicitly declared to
be GB2312 encoded. Don't know what is going on? Please help!

I have attached two files, one is original, which won't display properly for either 4.4.3 or 5.0,
and the other has an added GB2312 declaration, which would then load properly for 4.3.3, but
NOT for 5.0.

Thanks,
Shun-Chen

1020 Test.zip

yishn · 2020-03-12T16:56:46Z

Weird, the file with the added GB2312 declaration loads fine for me.

Addresses #7

MacErlang · 2020-03-12T20:10:41Z

@yishn I have tested several other files with added CA[GB2312], and they all do not
display properly. No idea why my installation is different, as my Mac binary is downloaded
from the release link.

Also, how do I test your new commit with an increased buffer size? Do I need to compile
Sabaki myself? Thanks for your help.

yishn · 2020-03-12T21:36:10Z

Yes, you'd need to build Sabaki yourself.

MacErlang · 2020-03-12T23:43:03Z

@yishn I have compiled Sabaki myself, and the issue persists.
I used the commands:

git clone https://github.com/SabakiHQ/Sabaki
cd Sabaki
npm install
npm run build

The compilation seemed to have worked fine. The executable and a screen dump are here:

https://www.dropbox.com/s/uqtyg9p0uwmos4i/Sabaki%20Compile.zip?dl=0

Thanks for your help,
Shun-Chen

MacErlang · 2020-03-12T23:47:45Z

Hi there, Shun-Chen Niu ([email protected]) invited you to view the file " Sabaki Compile.zip " on Dropbox. View file[1] Enjoy! The Dropbox team Shun-Chen and others will be able to see when you view this file. Other files shared with you through Dropbox may also show this info. Learn more[2] in our help center. [1]: https://www.dropbox.com/l/scl/AAB2c78RKHqDY_CUsWOj4Fk8aUUJgM0g_QA [2]: https://www.dropbox.com/l/AADIYPvImROQv58Cm-ifer-7tcC-wy5Gr1w

yishn · 2020-03-13T00:18:40Z

After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed?

MacErlang · 2020-03-13T01:29:39Z

Just compiled again, and it now works fine with and WITHOUT the GB2312 declaration! Great detective work and many thanks.

…

On Mar 12, 2020, at 7:18 PM, Yichuan Shen ***@***.***> wrote: After investigation, it seems we're accidentally excluding the decoding library from our bundle. This should be fixed on Sabaki master now. Can you pull, rebuild, and see if the problem is now fixed? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · 2020-03-15T02:09:22Z

@yishn Sorry to bother you again, but the new version still seems to have issues.
I am attaching two files, one named Original.sgf and the other GB2312.sgf. The Original
does not have any character declaration, and the other one has. What appears to be an
anomaly is that Original.sgf loads fine into 43.3 but does NOT display properly in 50.1. The
file GB2312.sgf does load fine in both versions of Sabaki.

So, there appears to be a discrepancy between the two Sabaki versions. The attached file
is fairly simple, so this is rather strange. Any ideas?

Best,
Shun

Sample.zip

yishn · 2020-03-15T10:30:06Z

Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer.

MacErlang · 2020-03-15T15:08:57Z

Just compiled after the new commit. The file Original.sgf still does not display properly.

…

On Mar 15, 2020, at 5:30 AM, Yichuan Shen ***@***.***> wrote: Hmm... it seems like detecting encoding on spliced test buffers didn't really work. Now we're just falling back to detecting encoding on the first 1000 bytes of the buffer. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

MacErlang · 2020-03-15T16:20:09Z

In trying to figure out what might have gone wrong, I have inspected lots of files, using the
newly compiled version. What is really weird is that the detection does not seem to be consistent.
I have attached two files, one labeled as Good and the other as Bad. As far as I could tell, the two
files are essentially identical in form, and yet one displays properly and the other does not.
Don't know if this might help you pin down the issue or not.

Shun

Samples.zip

MacErlang · 2020-03-15T16:24:32Z

Let me add that both files in Samples.zip load properly in 43.3.

yishn · 2020-03-15T16:56:28Z

In v0.43.3, we're guessing encoding based on the first 100 bytes of the files. After extending the encoding guessing to the first 1000 bytes of the file, it doesn't guess GB2312 anymore because, as @fohristiwhirl pointed out, "SGF naturally contains a bunch of UTF-8 looking stuff like B[cc]; W[dd]".

If we restrict ourselves to the first 100 bytes again, your original file would have issues, because in there, the first 100 bytes doesn't contain any Chinese.

For short term, we can probably just pick something between 100 bytes and 1000 bytes and guess encoding based on that. For long term, we should let the user pick their own encoding.

MacErlang · 2020-03-15T17:16:34Z

Does it make sense to let the detection scheme focus on the C[], PB[], and PW[] fields? These are areas where different encoding might make a difference (especially C[]).

yishn · 2020-03-15T17:28:28Z

Yes, that was what we were doing before, using spliced test buffers. But that doesn't work as evidenced by your previous samples. The detected encodings on the spliced test buffers were completely wrong.

MacErlang · 2020-03-17T22:52:44Z

@yishn I have tested some more files, and I am attaching four of them. These are all original
files I downloaded from the web; the only exception is that I added GB[2312] into the file
"倒垂莲（共二变）- GB2312.sgf". All of these files have trouble with either 50.1 or 43.3, in that
most would hang Sabaki. However, I have found out that they ALL load just fine in Version 35.1.
So, I am wondering if the "older" scheme used in 35.1 might be more robust than what has been
implemented in the recent versions?

BTW, I compiled the latest version, but noticed that the new option on user encoding selection
failed to commit properly.

Best,
Shun

Test Files.zip

yishn · 2020-03-18T00:03:29Z

This has nothing to do with encoding, so please open a new issue on Sabaki's repository about the hanging. FYI the new option on user encoding selection is not implemented, it's an open issue, please subscribe to it for updates.

MacErlang · 2020-03-18T01:17:17Z

Thanks, just posted there.

MacErlang changed the title ~~Issue with Display Chinese Character~~ Issue with Displaying Chinese Characters Feb 1, 2020

yishn transferred this issue from SabakiHQ/Sabaki Feb 3, 2020

yishn added the bug Something isn't working label Feb 3, 2020

yishn closed this as completed in 742d8d1 Feb 3, 2020

MacErlang mentioned this issue Mar 12, 2020

Hello, #8

Closed

yishn reopened this Mar 12, 2020

yishn added a commit that referenced this issue Mar 12, 2020

Increasing size of buffer for first encoding guess

3766246

Addresses #7

yishn closed this as completed Mar 13, 2020

yishn reopened this Mar 15, 2020

yishn closed this as completed in 5c22007 Mar 15, 2020

This comment has been minimized.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Displaying Chinese Characters #7

Issue with Displaying Chinese Characters #7

MacErlang commented Feb 1, 2020

yishn commented Feb 1, 2020 •

edited

Loading

rooklift commented Feb 2, 2020

MacErlang commented Feb 2, 2020

yishn commented Feb 2, 2020

rooklift commented Feb 2, 2020 •

edited

Loading

MacErlang commented Feb 2, 2020

MacErlang commented Feb 4, 2020 via email

MacErlang commented Mar 12, 2020

yishn commented Mar 12, 2020

MacErlang commented Mar 12, 2020

yishn commented Mar 12, 2020

MacErlang commented Mar 12, 2020

MacErlang commented Mar 12, 2020 via email

yishn commented Mar 13, 2020

MacErlang commented Mar 13, 2020 via email

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020

MacErlang commented Mar 15, 2020 via email

MacErlang commented Mar 15, 2020

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020 •

edited

Loading

This comment has been minimized.

MacErlang commented Mar 17, 2020

yishn commented Mar 18, 2020

MacErlang commented Mar 18, 2020

Issue with Displaying Chinese Characters #7

Issue with Displaying Chinese Characters #7

Comments

MacErlang commented Feb 1, 2020

yishn commented Feb 1, 2020 • edited Loading

rooklift commented Feb 2, 2020

MacErlang commented Feb 2, 2020

yishn commented Feb 2, 2020

rooklift commented Feb 2, 2020 • edited Loading

MacErlang commented Feb 2, 2020

MacErlang commented Feb 4, 2020 via email

MacErlang commented Mar 12, 2020

yishn commented Mar 12, 2020

MacErlang commented Mar 12, 2020

yishn commented Mar 12, 2020

MacErlang commented Mar 12, 2020

MacErlang commented Mar 12, 2020 via email

yishn commented Mar 13, 2020

MacErlang commented Mar 13, 2020 via email

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020

MacErlang commented Mar 15, 2020 via email

MacErlang commented Mar 15, 2020

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020

MacErlang commented Mar 15, 2020

yishn commented Mar 15, 2020 • edited Loading

This comment has been minimized.

MacErlang commented Mar 17, 2020

yishn commented Mar 18, 2020

MacErlang commented Mar 18, 2020

yishn commented Feb 1, 2020 •

edited

Loading

rooklift commented Feb 2, 2020 •

edited

Loading

yishn commented Mar 15, 2020 •

edited

Loading