Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophes wreak havoc on hyphenation #283

Closed
alerque opened this issue Mar 10, 2016 · 7 comments
Closed

Apostrophes wreak havoc on hyphenation #283

alerque opened this issue Mar 10, 2016 · 7 comments

Comments

@alerque
Copy link
Member

alerque commented Mar 10, 2016

This is related to #265, but that case is somewhat specific to a language anomaly and needing a way to setup exceptions. But there is a more general problem.

Basically any time apostrophes get involved everything goes to pot. Interestingly Unicode right single quotation marks fail in a different way that straight apostrophes. Here is an MWE:

\begin[papersize=a6]{document}
\font[family=Libertinus Serif,size=9pt,language=tr]
Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın

Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın

Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının
\end{document}

selection_183

This is especially puzzling to me because there are no shortage of hyphenation points in either of these words, nor are they different for the different quote styles:

> showHyphenationPoints("Tanrı’nın", "tr")
Tan-rı-’-nın
> showHyphenationPoints("Tanrı'nın", "tr")
Tan-rı-'-nın
> showHyphenationPoints("Rab’bimize", "tr")
Ra-b-’-bi-mize
> showHyphenationPoints("Rab'bimize", "tr")
Ra-b-'-bi-mize
@simoncozens
Copy link
Member

This is actually essentially the same bug as #265. The problem is about how text is being segmented; SILE isn't seeing "words" any more, only nodes after the text has been shaped by Harfbuzz and then formed into nodes by unicode.lua on the basis of each character's Unicode Character Database linebreak type. Quotation marks have a break type of "qu" and so are being formed into their own token by line 58 of unicode.lua.

> SU.concat(makenodes("Tanrı’nın", { language = "tr" }))
"N<22.28515625pt>^6.15234375-0.146484375v(Tanrı)N<2.2900390625pt>^7.1044921875-0v(’)N<13.7451171875pt>^4.6875-0v(nın)"

Maybe we should pass through quotes just like we do with combining marks. I have a patch which would potentially do this, but I am worried if it would break other languages by allowing strange hyphenation points:

> > showHyphenationPoints("sti'cious", "en")
"sti'-cious"

@alerque
Copy link
Member Author

alerque commented Mar 29, 2016

Can you push that to a test branch? I'd be interested in trying such a change. I have an idea we might need to look at some language specific tweaks in this department.

Another day, another book, another issue that I think is related to this: my verse references are breaking on figure/en-dashes; e.g.:

Deneme olarak metin burada (İbraniler 5:9; 6:1–
7) ve daha fazla metin.

I have a non-breaking space after the book name, but I would like to inhibit (or at least severely penalize) breaks before and after in the same way it doesn't break on : or ). My preference would be to encourage a break on the space between ; 6 or after the parenthesis, but not in the middle of the number range.

Unfortunately this isn't quite standard Turkish usage. According to the official Turkish Language Institute people Turkish doesn't have a figure dash or en-dash at all. It has either a hyphen or a long (em) dash. However using hyphens for everything -including insertions like this one- quickly gets messy and many publishers use a figure dash as well. I'm one of those that thinks the figure dash adds something over using the same glyph used for hyphenation, ergo I'd like to be able to define rules for its usage that inhibit it being a line breaking character. This seems to me to be about the same problem as I'm having with apostrophes, only in reverse.

@simoncozens
Copy link
Member

I'm not going to push it to a test branch, because it will break English, but here's the patch:

diff --git a/languages/unicode.lua b/languages/unicode.lua
index 7bc7aae..4c63384 100644
--- a/languages/unicode.lua
+++ b/languages/unicode.lua
@@ -55,7 +55,7 @@ SILE.nodeMakers.unicode = SILE.nodeMakers.base {
       self:addToken(char,item)
       self:makeToken()
       self:makePenalty(0)
-    elseif lasttype and (thistype ~= lasttype and thistype ~= "cm") then
+    elseif lasttype and (thistype ~= lasttype and thistype ~= "cm" and thistype ~= "qu") then
       self:makeToken()
       self:addToken(char,item)
     else
@@ -65,7 +65,7 @@ SILE.nodeMakers.unicode = SILE.nodeMakers.base {
       end
       self:addToken(char,item)
     end
-    if thistype ~= "cm" then lasttype = chardata[cp] and chardata[cp].linebreak end
+    if thistype ~= "cm" and thistype ~= "qu" then lasttype = chardata[cp] and chardata[cp].linebreak end
   end,
   iterator = function (self, items)
     self:init()

@alerque
Copy link
Member Author

alerque commented May 27, 2016

Hey that looks more like it!

Is there a general way to control which side of the apostrophe to break on? I'd typically like to keep it on the trailing line before the hyphen instead of bumping it to the new line, but since there are break points on both sides of the apostrophe in does whatever is best for the word spacing and sometimes they stay and sometimes they don't.

alerque added a commit to alerque/sile that referenced this issue May 30, 2016
Travis doesn't (yet) have the Libertinus fork version of the font used
in this test. Also in order to trigger the widest range of failures
possible the original test used a very specific page and font size
combination. This adaption allows it to work on A5 paper but setting the
corresponding font size that trips up the most break points.
@r12a
Copy link

r12a commented Jun 18, 2018

I'm wondering, given that this seems to be a small number of words in Turkish, whether another possible solution is to use the Unicode WORD JOINER character?

@alerque
Copy link
Member Author

alerque commented Jun 18, 2018

@r12a The list of possible words is by no means small (I could find hundreds of words to use as examples), and one of the principals I'm working from is that clean source text shouldn't need special treatment to be typeset. Obviously in the case of language exceptions having control characters like that might be acceptable, but we're not talking about exceptions here — this is the rule.

@r12a
Copy link

r12a commented Jun 19, 2018

Ok, thanks for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants