Apostrophes wreak havoc on hyphenation #283

alerque · 2016-03-10T09:05:48Z

This is related to #265, but that case is somewhat specific to a language anomaly and needing a way to setup exceptions. But there is a more general problem.

Basically any time apostrophes get involved everything goes to pot. Interestingly Unicode right single quotation marks fail in a different way that straight apostrophes. Here is an MWE:

\begin[papersize=a6]{document}
\font[family=Libertinus Serif,size=9pt,language=tr]
Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın Rab'bimize Tanrı'nın

Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın Rab’bimize Tanrı’nın

Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının Rabbimize Tanrının
\end{document}

This is especially puzzling to me because there are no shortage of hyphenation points in either of these words, nor are they different for the different quote styles:

> showHyphenationPoints("Tanrı’nın", "tr")
Tan-rı-’-nın
> showHyphenationPoints("Tanrı'nın", "tr")
Tan-rı-'-nın
> showHyphenationPoints("Rab’bimize", "tr")
Ra-b-’-bi-mize
> showHyphenationPoints("Rab'bimize", "tr")
Ra-b-'-bi-mize

The text was updated successfully, but these errors were encountered:

simoncozens · 2016-03-23T10:58:03Z

This is actually essentially the same bug as #265. The problem is about how text is being segmented; SILE isn't seeing "words" any more, only nodes after the text has been shaped by Harfbuzz and then formed into nodes by unicode.lua on the basis of each character's Unicode Character Database linebreak type. Quotation marks have a break type of "qu" and so are being formed into their own token by line 58 of unicode.lua.

> SU.concat(makenodes("Tanrı’nın", { language = "tr" }))
"N<22.28515625pt>^6.15234375-0.146484375v(Tanrı)N<2.2900390625pt>^7.1044921875-0v(’)N<13.7451171875pt>^4.6875-0v(nın)"

Maybe we should pass through quotes just like we do with combining marks. I have a patch which would potentially do this, but I am worried if it would break other languages by allowing strange hyphenation points:

> > showHyphenationPoints("sti'cious", "en")
"sti'-cious"

alerque · 2016-03-29T09:23:46Z

Can you push that to a test branch? I'd be interested in trying such a change. I have an idea we might need to look at some language specific tweaks in this department.

Another day, another book, another issue that I think is related to this: my verse references are breaking on figure/en-dashes; e.g.:

Deneme olarak metin burada (İbraniler 5:9; 6:1–
7) ve daha fazla metin.

I have a non-breaking space after the book name, but I would like to inhibit (or at least severely penalize) breaks before and after – in the same way it doesn't break on : or ). My preference would be to encourage a break on the space between ; 6 or after the parenthesis, but not in the middle of the number range.

Unfortunately this isn't quite standard Turkish usage. According to the official Turkish Language Institute people Turkish doesn't have a figure dash or en-dash at all. It has either a hyphen or a long (em) dash. However using hyphens for everything -including insertions like this one- quickly gets messy and many publishers use a figure dash as well. I'm one of those that thinks the figure dash adds something over using the same glyph used for hyphenation, ergo I'd like to be able to define rules for its usage that inhibit it being a line breaking character. This seems to me to be about the same problem as I'm having with apostrophes, only in reverse.

simoncozens · 2016-04-10T01:40:04Z

I'm not going to push it to a test branch, because it will break English, but here's the patch:

diff --git a/languages/unicode.lua b/languages/unicode.lua
index 7bc7aae..4c63384 100644
--- a/languages/unicode.lua
+++ b/languages/unicode.lua
@@ -55,7 +55,7 @@ SILE.nodeMakers.unicode = SILE.nodeMakers.base {
       self:addToken(char,item)
       self:makeToken()
       self:makePenalty(0)
-    elseif lasttype and (thistype ~= lasttype and thistype ~= "cm") then
+    elseif lasttype and (thistype ~= lasttype and thistype ~= "cm" and thistype ~= "qu") then
       self:makeToken()
       self:addToken(char,item)
     else
@@ -65,7 +65,7 @@ SILE.nodeMakers.unicode = SILE.nodeMakers.base {
       end
       self:addToken(char,item)
     end
-    if thistype ~= "cm" then lasttype = chardata[cp] and chardata[cp].linebreak end
+    if thistype ~= "cm" and thistype ~= "qu" then lasttype = chardata[cp] and chardata[cp].linebreak end
   end,
   iterator = function (self, items)
     self:init()

alerque · 2016-05-27T13:52:31Z

Hey that looks more like it!

Is there a general way to control which side of the apostrophe to break on? I'd typically like to keep it on the trailing line before the hyphen instead of bumping it to the new line, but since there are break points on both sides of the apostrophe in does whatever is best for the word spacing and sometimes they stay and sometimes they don't.

Travis doesn't (yet) have the Libertinus fork version of the font used in this test. Also in order to trigger the widest range of failures possible the original test used a very specific page and font size combination. This adaption allows it to work on A5 paper but setting the corresponding font size that trips up the most break points.

r12a · 2018-06-18T15:18:03Z

I'm wondering, given that this seems to be a small number of words in Turkish, whether another possible solution is to use the Unicode WORD JOINER character?

alerque · 2018-06-18T19:02:17Z

@r12a The list of possible words is by no means small (I could find hundreds of words to use as examples), and one of the principals I'm working from is that clean source text shouldn't need special treatment to be typeset. Obviously in the case of language exceptions having control characters like that might be acceptable, but we're not talking about exceptions here — this is the rule.

r12a · 2018-06-19T09:06:09Z

Ok, thanks for clarifying.

simoncozens closed this as completed in 9d6691f May 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apostrophes wreak havoc on hyphenation #283

Apostrophes wreak havoc on hyphenation #283

alerque commented Mar 10, 2016

simoncozens commented Mar 23, 2016

alerque commented Mar 29, 2016

simoncozens commented Apr 10, 2016

alerque commented May 27, 2016

r12a commented Jun 18, 2018

alerque commented Jun 18, 2018

r12a commented Jun 19, 2018

Apostrophes wreak havoc on hyphenation #283

Apostrophes wreak havoc on hyphenation #283

Comments

alerque commented Mar 10, 2016

simoncozens commented Mar 23, 2016

alerque commented Mar 29, 2016

simoncozens commented Apr 10, 2016

alerque commented May 27, 2016

r12a commented Jun 18, 2018

alerque commented Jun 18, 2018

r12a commented Jun 19, 2018