-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Punctuation merge at the end of citation suffix doesn’t work with unicode last characters #33
Comments
We're not trying to detect final punctuation (which we could do in a unicode-friendly way using What about curly quotes? Yes, in general, we want a period, since there is no way of knowing ahead of time whether we have e.g. just a quoted word; also in some styles we allow dots after quotes. There is a different part of the code that is responsible for moving periods inside quotes (when that is desired). This code knows to check inside |
Some possible solutions:
|
Thank you, John, for the further explanation and the suggestions!
I have checked the summary table on quotation marks on Wikipedia and Hebrew seems to be the only language that uses „ (U+201E) or ‚ (U+201A) on the right side of a quotation, however, it is written from right to left. So, treating those as opening a quotation shouldn’t be wrong in any case. And yes, I am primarily thinking of German, with the primary quotation marks As they don’t conflict, I would prefer if they were always enabled, because I am not sure if pandocs reads the babel option from the LaTeX preamble. At least in LibreOffice, all German text is marked by the spellchecker, so it seems the document uses English as the default language. So always accepting those combinations should do the trick, as this new MWE with another CSL shows with yesterday’s nightly build – you can tell from the plain text output I chose for readability here that it works in the case where it recognizes the quotation. With the CSL I used before for my test, it somehow didn’t work in this case either so I thought that wouldn’t be of any use. This doesn’t have an additional dot.\cite[See][3, and this is important!]{introduction}
But this has.\cite[See][3: „some words from the citation.“]{introduction}
This works again as expected.\cite[See][3: “some words from the citation.”]{introduction} pandoc -f latex -C --bibliography pandoc_test.bib --csl footnote_style.csl -t plain
(By the way, you can tell from the output that it treats the colon as part of the page number … In BibLateX, I can specify the page number part with \pnfmt and wrap the rest in a \passifpages command, but I haven’t found a way here.) pandoc_test.bib@incollection{introduction,
author = {Emil Editor},
crossref = {collection},
pages = {3–17},
title = {Introduction to the Essays}}
@collection{collection,
address = {Edinburgh},
editor = {Herbert Herausgeber and Emil Editor},
title = {The Ultimate \TeX nic Bibliographer},
year = {2000}} |
This is likely something that can be fixed in the odt/opendocument writer. If you use
Why do you say so? I don't think it is.
See the manual here:
|
Thank you for your help in narrowing focus on this issue, John! 🤓 So, the first part of the reply is about my suggestion to parse I understand that you might want pandoc to be strict here. However, as typing UTF-8 quotes doesn’t trigger a “Have I selected the correct language for these quotes in the metadata/for this sentence?” question automatically, I would still appreciate if it could just parse non-conflicting quotation marks in all languages. At least when a certain language is selected in metadata, all corresponding Unicode quotation marks should be recognized as quotation elements, I suggest. There is an appropriate summary table on quotation marks on Wikipedia. Actually, if you want to be strict, English curly quotation marks shouldn’t be parsed as quotation environments then. But, as I explained above, I prefer a more relaxed approach here 😉
No, it doesn’t include MWE% !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[ngerman,british]{babel}
\begin{document}
Wird der Text als Deutsch markiert? Werden „deutsche Anführungszeichen“ akzeptiert? \foreignlanguage{british}{And what about “English parts”?}
\end{document} pandoc -f latex --metadata lang=de-DE -o test.odt
I say so because if it didn’t treat the colon as part of the page number, imho it should put out in the MWE from my last comment:
I have tried with a MWE and pandoc doesn’t parse the |
This is really style-dependent. I don't know about the csl you're using (it's not in the main repository), but I tried your example with both |
Hi, thank you! Actually, I am using I would like to come back to the solutions you suggested for the main topic of this issue in your comment https://github.com/jgm/pandoc/issues/6879#issuecomment-732318725:
I have checked the summary table on quotation marks on Wikipedia and Hebrew seems to be the only language that uses However, as the MWE below shows, the parsing has to be strictly language specific, because otherwise, e. g. an input of
Yes, I am primarily thinking of German, with the primary quotation marks MWE (folded, click to unfold)I am using the current version, % !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[british,ngerman]{babel}
\begin{document}
Verwendung „deutscher“ Anführungszeichen.
Usage of “English” quotation marks.
\foreignlanguage{ngerman}{Verwendung „deutscher“ Anführungszeichen im deutschen Kontext}
\foreignlanguage{ngerman}{Usage of “English” quotation marks in German context.}
\end{document} Run through the command
I would expect, if quotation marks were treated strictly language-specific: (Again, I have set the relevant parts in bold and italicised the beginnings of the language environments.)
This would, however, also need a fix of the Quotation writers, because running this expected output back through
Notice how in the back conversion, even in the explicitly German context, the If this route is chosen, this issue becomes “Reading and Writing of Non-English Curly Quotation Marks”. But, you have suggested another solution in the same comment, https://github.com/jgm/pandoc/issues/6879#issuecomment-732318725:
This would be fine also and maybe a much simpler (intermediate) solution, as no full handling of reading and writing of international curly quotation marks has to be supported. In German, possible closing quotation marks are |
One note: the default de-DE locale for citeproc has |
I am using a custom CSL and I am quite free in how to style my citations as long as it doesn’t change within the publication. Generally, I am trying to make an exact duplicate of the output of the BibLaTeX footnote-dw style. I have lots of citations where I give the original (English) text in the footnote, so these are full sentences. When the end of the sentence with its full stop is part of the direct citation, I want the full stop to be inside the curly quotes. I don’t want to have another dot after the curly quotes in those occasions; only when I cite a single word or a phrase that isn’t at the end of the sentence, I won’t put a full stop myself and rely on the footnote style to add it at the end of the suffix. Having a look at that % !TEX encoding = UTF-8 Unicode
\documentclass{article}
\usepackage[style=footnote-dw]{biblatex}
\addbibresource{mwe.bib}
\begin{document}
This is where I would like to have the full stop inside the curly quotes, also in German.\cite[See][3, consider especially: “It can be difficult to deal with full stops and curly quotes.”]{introduction}
This is where the full stop should go at the very end.\cite[See][4, “different situation”]{introduction}
\end{document} Running this with LuaLaTeX and Biber results in the following text in the resulting PDF file:
Running it through
I begin to realize that I will probably just have to go through all footnotes manually and do those adjustments … Maybe I’ll do an intermediate step with a text format so I can scope regex substitutions to the footnotes. |
You shouldn't hope for exact duplication of what biblatex does. |
Yes, I understand. I’m experimenting a little bit with
Is there something further I can do to help you with that? |
Actually I see now that the issue comes up in citeproc itself, not in anything pandoc does. |
Hi, out of gratitude for this great piece of software I have tried a little bit to understand how Haskell works , but still I’m not so much familiar with it … I think this is the piece of code telling if a final dot should be added to a citation or not, isn’t it? Is it possible that it doesn’t catch cases where the final character isn’t an ASCII but a Unicode character?
https://github.com/jgm/pandoc/blob/68b298ed9aee405033da9a2b44ae86f2241a123d/src/Text/Pandoc/Citeproc.hs#L394-L405
I think, the merging of punctuation at the end of a citation suffix doesn’t work with unicode last characters, however, it works with ASCII characters. Maybe that
(d:c:_)
doesn’t allow_
to be a unicode character? At least in the LuaLaTeX/BibLaTeX/Biber pipeline I don’t get extra dots after footnote citations ending with a curly quote character.It is difficult to provide a small MWE, because it depends on the CSL – the standard output is in parentheses, without a dot suffix. For the MWE, I will use the CSL provided at https://www.zotero.org/styles?q=id%3Auniversitat-freiburg-geschichte and call the file
footnote_style.csl
.For testing purposes, I am running the following LaTeX code through the following command, using yesterdays’s nightly build, as in the current version, 2.11.2, there would even be brackets around the suffix as resolved in 9a40976 and 7db2cf5.
I am getting this output …
The important aspect is that the second suffix is taken as
,Str "\8222",Str "some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation",Str ".",Str "\8220."
I would expect there not to be a dot after
\8220
, parallel to the way BibLaTeX treats the situation:,Str "\8222",Str "some",Space,Str "words",Space,Str "from",Space,Str "the",Space,Str "citation",Str ".",Str "\8220"
pandoc_test.bib
The text was updated successfully, but these errors were encountered: