-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better general support for CJK languages #229
Comments
Actually, there are more issues with CJK contents. It could be lacking of overall support of CJK language in I noticed that As an example, if I have two tex files with only two sentences different:
\DIFdelbegin \DIFdel{一一一一一一一一一一一一
}\DIFdelend \DIFaddbegin \DIFadd{一一一一一一二一一一一一
}\DIFaddend instead of something like
|
Thank you for highlighting this deficiency. You reported two issues. I will answer to the second one first (about only marking differences in text). Actually I cannot see the characters in your second example, everything is replaced by dashed; it seems support for this encoding is lacking in either github or my browser. I can see the characters in the first post, though. You are right, latexdiff was initially developed without CJK support in mind, as I don't have access to examples to develop this, and no understanding of the language conventions myself.
(if you can't find them, then your version is definitely too old). If you have the commands above in your latexdiff, your script is Han and it still does not work then let me know. As copy/paste in non-native encoding is tricky could I ask you to make MWE files available for download rather than copy/paste. Ideally provide me with old and new file, e.g. put into zip file and attach here. |
On the first reported issue (overlong lines). Unfortunately this is a limitation with the ulem package which is used for underlining. You can try taking this to ulem maintainers, but there is nothing I can do about this. Workaround here is to choose another highlighting style with the |
Thanks for the reply and effort! I will try the latest code later. As for the character issues, there is nothing wrong with your browser or font at all. "一" is one and "二" is two in Chinese, respectively (also obviously). I thought that would showcase my expectations better. Apparently it has some pitfalls :) |
Tried the current code. It's not quite working for Chinese though. First, the code change you mentioned is 6 years old. I am using the As for the code, I think it's only a small improvement to Japanese only. Japanese characters have three different sets: Katakana, Hiragana and Kanji, with the Kanji basically same as Chinese characters (I guess, hence However, Chinese characters are all treated the same in programs, so the result is the same whether the change is made or not. I followed the code and tried the following change: - my $word_ja='\p{Han}+|\p{InHiragana}+|\p{InKatakana}+';
+ my $word_ja='\p{Han}|\p{InHiragana}|\p{InKatakana}'; Which should indicate it's now character-based word splitting for CJ(not K). I tried in my actual document, it's working quite well (I have to work out why the deletion fonts are smaller, though (Edit: Okay, that's just how |
@ftilmann, Apologies if I'm bombarding you with notifications/emails. I just found I am not the first coming up with this issue, there is another one regarding Japanese: #145. Thus, I propose that my character-based word splitting strategy should be an option, either being default or not, instead of just changing the behavior for good. It's for the better if the new behavior is the default, since CJK does not split words by spaces at all. The user can choose the old behavior if they are not happy. The matching regex needs improving, tho, it currently matches part of CJK, but not all. |
OK, I have implemented this change now, i.e., character-based processing. From the two issues it seems to me that character-based processing is practically always going to be the right thing to do and the closest equivalent to word-wise processing in phonetic languages (two-characters 'words' are an edge case, but I think that's something unavoidable. So for now I just change the behaviour rather than introducing an option as suggested by you, The reason is that in the end many options make the program and manual quite complex. If people are complaining and want the old behaviour back I can still add the option. Thanks you for checking this out and also reminding me of the old issue. |
Agreed. Oh, the sample test is from 5ad4074, I don't mean to remove the zip file along with comment text, though. I am Chinese, the sample is just me figuring out what the forementioned commit was doing :) |
The issue is that the CJK characters enclosed in
\DIFadd{}
or\DIFdel{}
do not wrap at the end of lines. This is the first time I uselatexdiff
so I am not sure if I am missing anything. I am using ctex package to provide the CJK environment.I am using texlive-core package of version 2020.57066 on Arch Linux, so latexdiff here is not on the latest git version.
Here is a minimal working example here.
The generated PDF file, where the issue is shown:
templ-zh-diff.pdf
This is the generated Tex file by
latexdiff
, the only difference I made is the line got longer by repeating the same phrase multiple times:The new version PDF without latexdiff:
templ-zh-new.pdf
The new version Tex file:
I compiled them with
The text was updated successfully, but these errors were encountered: