Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“encoding” Problem #195

Closed
Key033 opened this issue Apr 4, 2020 · 8 comments · Fixed by #276
Closed

“encoding” Problem #195

Key033 opened this issue Apr 4, 2020 · 8 comments · Fixed by #276

Comments

@Key033
Copy link

Key033 commented Apr 4, 2020

The version of latexdiff is

This is LATEXDIFF 1.3.0 (Algorithm::Diff 1.15 so, Perl v5.28.1)
(c) 2004-2018 F J Tilmann
Preamble Internal Type UNDERLINE
Preamble Internal Type SAFE
Preamble Internal Type FLOATSAFE

Working on Windows10 1909.

When I try to latexdiff the tex with the command like "latexdiff old.tex new.tex > diff.tex" or "latexdiff --encoding=utf8 old.tex new.tex > diff.tex", the "diff.tex" is encoded by UTF-16 LE, where the "old.tex" and "new.tex" are encoded by UTF-8. And the UTF-8 characters like Chinese and Japanese will be garbled.

For example,
"old.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个测试文档。
\end{document}

"new.tex"

\documentclass{article}
\usepackage[UTF8]{ctex}
\begin{document}
你好,这是一个新的测试文档。
\end{document}

“diff.tex"

latex\documentclass{article}
%DIF LATEXDIFF DIFFERENCE FILE
%DIF DEL old.tex   Sat Apr  4 22:12:08 2020
%DIF ADD new.tex   Sat Apr  4 22:12:03 2020
\usepackage[UTF8]{ctex}
%DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
%DIF UNDERLINE PREAMBLE %DIF PREAMBLE
\RequirePackage[normalem]{ulem} %DIF PREAMBLE
\RequirePackage{color}\definecolor{RED}{rgb}{1,0,0}\definecolor{BLUE}{rgb}{0,0,1} %DIF PREAMBLE
\providecommand{\DIFadd}[1]{{\protect\color{blue}\uwave{#1}}} %DIF PREAMBLE
\providecommand{\DIFdel}[1]{{\protect\color{red}\sout{#1}}}                      %DIF PREAMBLE
%DIF SAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddbegin}{} %DIF PREAMBLE
\providecommand{\DIFaddend}{} %DIF PREAMBLE
\providecommand{\DIFdelbegin}{} %DIF PREAMBLE
\providecommand{\DIFdelend}{} %DIF PREAMBLE
\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
\providecommand{\DIFmodend}{} %DIF PREAMBLE
%DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
\providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
\providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
\providecommand{\DIFaddbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFaddendFL}{} %DIF PREAMBLE
\providecommand{\DIFdelbeginFL}{} %DIF PREAMBLE
\providecommand{\DIFdelendFL}{} %DIF PREAMBLE
%DIF LISTINGS PREAMBLE %DIF PREAMBLE
\RequirePackage{listings} %DIF PREAMBLE
\RequirePackage{color} %DIF PREAMBLE
\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
} %DIF PREAMBLE
\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
	language=DIFcode, %DIF PREAMBLE
	basicstyle=\ttfamily, %DIF PREAMBLE
	columns=fullflexible, %DIF PREAMBLE
	keepspaces=true %DIF PREAMBLE
} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
%DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF

\begin{document}
浣犲ソ锛孿DIFdelbegin \DIFdel{杩欐槸涓€涓祴璇曟枃妗c€?
 }\DIFdelend \DIFaddbegin \DIFadd{杩欐槸涓€涓柊鐨勬祴璇曟枃妗c€?
 }\DIFaddend\end{document}
@Key033
Copy link
Author

Key033 commented Apr 4, 2020

I found if the old.tex and new.tex are encoded by UTF-8 with BOM, the diff.tex can be output with correct UTF8 characters and is encoded by UTF-16, which can be re-encoded to UTF-8 easily.

@ftilmann
Copy link
Owner

ftilmann commented Apr 4, 2020

So is it solved? What is BOM?

@Key033
Copy link
Author

Key033 commented Apr 4, 2020

So is it solved? What is BOM?

The UTF-8 BOM is a sequence of Bytes at the start of a text-stream (0xEF,0xBB,0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Ref: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom

I re-encoded the files by Vscode's "Save with Encoding" function.

And I think there is something wrong with the variable $encoding, but I haven't learned Perl.

@ftilmann
Copy link
Owner

Thanks for this report. The encoding is mostly dealt with by perl and (as you could see from my question) I have no real insight into the encoding. So I will not tackle this anytime soon but will leave the issue open in case anyone has an insight.

@henrysky
Copy link

henrysky commented Aug 3, 2022

I have just encountered this issue. You should use the good old CMD on Windows or PowerShell 6.2+ as the default Powershell in Windows 10/11 output file encoded with UTF-16 when you use >. Sometimes it is not as simple as re-encoding to UTF-8 as character like é in .tex file will turn to jibberish ├⌐ if using latexdiff on PowerShell <6.2 and cannot be recovered even re-encoding to UTF-8. I will say nothing is wrong with latexdiff or perl.

@jonschz
Copy link

jonschz commented Nov 5, 2022

Edit: The command below works, but also breaks utf-8 characters. I will stick with cmd and consider adding this to the FAQ.

You can use the following in powershell to get a utf-8 output file, but it will still break when there are non-standard characters in the .tex files.

latexdiff a.tex b.tex | Out-File output.tex -Encoding utf8

@jonschz
Copy link

jonschz commented Nov 5, 2022

Edit

The bigger issue seems to be that Powershell does not use Unicode to pipe the output from one command into another, see https://markw.dev/unicode_powershell/. I was able to get latexdiff to work in powershell using the following:

> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
> latexdiff .\latex_test_files\utf8_a.tex .\latex_test_files\utf8_b.tex | Out-File -Encoding utf8 out.tex

I would still recommend using cmd instead, and I will work on the pull request now.

Original text

Addendum: It appears that this is known problem with Perl in general under Windows.

See e.g. https://stackoverflow.com/a/66281302 and StrawberryPerl/Perl-Dist-Strawberry#18.

See also https://stackoverflow.com/q/4942305; many other languages like Python and Node.js have since solved this issue.

I messed around a bit in Perl, tried some things, but it seems like there is no working pure-Perl solution. It seems like the Perl developers cannot easily change this, either, as it will break legacy code.

Solution for now

it seems to be best to just use cmd under Windows. Maybe I'll create a pull request to update the documentation.

Future

I have two ideas how one could mitigate this problem:

  1. One could implement direct output to files like latexdiff --outfile=out.tex a.tex b.tex. I suspect this will be quite a bit of work to implement, though.
  2. Another (hypothetical) possiblity is to modify the latexdiff.exe wrapper to fix the output. Not sure how complicated that will be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants