Footnotes give text a different color. #45

bertcoerver · 2019-12-30T13:30:44Z

Hello,

I'm having the following problem when converting books, related to (references to) footnotes.

For some of my converted books, each first footnote in a chapter causes all the preceding text in that chapter to have a different color from the default color. After the first footnote, the text has the default color and any subsequent footnotes in that chapter work as intended.

The original epub with a footnote:

The converted epub with all the text preceding the first footnote in the chapter in a different color:

pgaskin · 2020-01-04T20:18:55Z

I'm guessing this is due to a self-closing A tag (e.g. <a id="footnote1" />), which is against the spec, but some badly-designed books use it. When it is parsed by kepubify (or any other semi-strict parser), it parses it as an open tag only, and closes it in the last possible place.

I could add a workaround to kepubify, but that's almost outside the scope of kepubify. I'll see if the KEPUB reader accepts invalid void elements when I have time, and if it does, I'll also make kepubify do it as well.

pgaskin · 2020-01-12T04:00:36Z

This is what happens when it is parsed:

https://play.golang.org/p/2VHmxGqKm5P

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	doc, _ := goquery.NewDocumentFromReader(strings.NewReader(`<!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test"/>test 2<p>test 3</p></body></html>`))
	html, _ := doc.Html()
	fmt.Println(html)
}

// output: <!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test">test 2<p>test 3</p></a></body></html>

I'm going to test it with the KEPUB renderer now.

Update: The KEPUB renderer doesn't seem to be as strict, so I'll see what I can do with fixing this.

pgaskin · 2020-01-12T04:29:58Z

I won't be fixing this issue immediately, as it requires restructuring how the parsing is done.

The issue is that kepubify always parses the HTML as HTML5, which means it follows HTML5 parsing rules, namely about treating self-closing tags which aren't void elements as if the closing doesn't exist (see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-non-void-html-element-start-tag-with-trailing-solidus). The KEPUB renderer either parses it according to the DOCTYPE or always as XHTML (I haven't checked yet).

To fix this problem, I have a few options:

Fork golang.org/x/net/html and allow self-closing a tags by not ignoring the /. This would also allow me to remove the ugly workaround for self-closing title tags.
Parse everything as XHTML. This is probably the least optimal solution, as it would prevent me from using goquery, and would also break proper HTML5 books, which are a lot more common.
Parse everything according to the DOCTYPE. This would work, but would result in either a lot of code duplication or a lot of abstraction (or embedding a C library like lxml, which I'd rather not do).
Fix this specific case with a regexp. I'd rather not add any more regexp-hacks to kepubify, as there are already enough as it is.

I'm most likely going to do either the first or the fourth option. I'll probably leave it for the next version, though, and improve the whole content manipulation code while I'm at it.

Update: I've decided to go with the first option. Here's the mostly working patch I have so far, but I still need to write tests for it and incorporate a few other changes.

diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
 
 func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
                        p.tokenizer.NextIsNotRawText()
                        return true
                case a.Script, a.Title:
+                       // MOD(geek1011): Allow title to be self-closing. See the mod below
+                       //     for A elements for more details. This is often found in XHTML
+                       //     EPUBs, where generators don't add any text to the title tag
+                       //     and use an XML renderer (which then self-closes it, which is
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
 
 func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
                        p.tokenizer.NextIsNotRawText()
                        return true
                case a.Script, a.Title:
+                       // MOD(geek1011): Allow title to be self-closing. See the mod below
+                       //     for A elements for more details. This is often found in XHTML
+                       //     EPUBs, where generators don't add any text to the title tag
+                       //     and use an XML renderer (which then self-closes it, which is
+                       //     ignored in the HTML5 spec, and results in everything after
+                       //     becoming the title when parsed using a compliant HTML5 parser).
+                       //
+                       //     Also allow script to be self-closing, as I've seen it in quite
+                       //     a few cases when HTML is generated using an XML encoder and
+                       //     it's a script[src].
+                       if p.lenientSelfClosing && p.hasSelfClosingToken {
+                               // Add the element, but immediately remove it from the stack (
+                               // this is necessary because it would usually be removed after
+                               // the raw text, which we're skipping).
+                               p.addElement()
+                               p.oe.pop()
+
+                               // This doesn't actually do anything here, but we acknowledge it
+                               // for consistency.
+                               p.acknowledgeSelfClosingTag()
+
+                               // There isn't an end tag for Title, so it considers it raw text
+                               // afterwards (even if we don't add the element) unless we tell
+                               // it not to consider it as such.
+                               p.tokenizer.NextIsNotRawText()
+
+                               return true
+                       }
+                       // END MOD
+
                        p.addElement()
                        p.setOriginalIM()
                        p.im = textIM
@@ -975,6 +1009,31 @@ func inBodyIM(p *parser) bool {
                                }
                        }
                        p.reconstructActiveFormattingElements()
+                       // MOD(geek1011): Allow A to be self-closing. THIS IS NOT SPEC-
+                       //     COMPLIANT, but won't cause any issues in basically any real-
+                       //     world case, as people don't go doing things like
+                       //     `<a href="/whatever" />link text</a>` and expect it to work (
+                       //     the spec says to ignore the self-closing on any non-void
+                       //     element). This fixes cases where people (or generators) do
+                       //     things like `<a id="aRandomID" />some more text` and expect
+                       //     it to be treated like XHTML/XML, where the self-closing would
+                       //     work.
+                       //
+                       //     Based on the case for void elements (br, wbr, input, ...),
+                       //     but only closes them if they have the self closing token (/>)
+                       //     rather than closing them and ignoring content in all cases.
+                       //
+                       //     This is done after reconstructActiveFormattingElements (but
+                       //     before addFormattingElement, which would be useless as there
+                       //     isn't any content in a self-closing A) to prevent breaking
+                       //     open formatting elements before the A tag.
+                       if p.lenientSelfClosing && p.hasSelfClosingToken {
+                               p.addElement()
+                               p.oe.pop()
+                               p.acknowledgeSelfClosingTag()
+                               break
+                       }
+                       // END MOD
                        p.addFormattingElement()
                case a.B, a.Big, a.Code, a.Em, a.Font, a.I, a.S, a.Small, a.Strike, a.Strong, a.Tt, a.U:
                        p.reconstructActiveFormattingElements()
@@ -1093,6 +1152,15 @@ func inBodyIM(p *parser) bool {
                case a.Caption, a.Col, a.Colgroup, a.Frame, a.Head, a.Tbody, a.Td, a.Tfoot, a.Th, a.Thead, a.Tr:
                        // Ignore the token.
                default:
+                       // MOD(geek1011): Allow span to be self-closing. See the mod above
+                       //     for A elements for more details.
+                       if p.lenientSelfClosing && p.tok.DataAtom == a.Span && p.hasSelfClosingToken {
+                               p.addElement()
+                               p.oe.pop()
+                               p.acknowledgeSelfClosingTag()
+                               break
+                       }
+                       // END MOD
                        p.reconstructActiveFormattingElements()
                        p.addElement()
                }
@@ -2331,6 +2399,15 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
        }
 }
 
+// ParseOptionLenientSelfClosing controls whether to allow additional non-void
+// elements to be self closing (<whatever ... />). This is mainly for better
+// compatibility with XHTML found in EPUBs. MOD(geek1011)
+func ParseOptionLenientSelfClosing(enable bool) ParseOption {
+       return func(p *parser) {
+               p.lenientSelfClosing = enable
+       }
+}
+
 // ParseWithOptions is like Parse, with options.
 func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
        p := &parser{

This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.

This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See #45 and #28 for a few examples.

pgaskin added bug epub issue Issue with the source epub undecided labels Jan 4, 2020

pgaskin removed the undecided label Jan 12, 2020

pgaskin mentioned this issue Jan 12, 2020

Improve HTML parsing and manipulation #47

Closed

pgaskin closed this as completed in 1282954 Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Footnotes give text a different color. #45

Footnotes give text a different color. #45

bertcoerver commented Dec 30, 2019

pgaskin commented Jan 4, 2020 •

edited

Loading

pgaskin commented Jan 12, 2020 •

edited

Loading

pgaskin commented Jan 12, 2020 •

edited

Loading

Footnotes give text a different color. #45

Footnotes give text a different color. #45

Comments

bertcoerver commented Dec 30, 2019

pgaskin commented Jan 4, 2020 • edited Loading

pgaskin commented Jan 12, 2020 • edited Loading

pgaskin commented Jan 12, 2020 • edited Loading

pgaskin commented Jan 4, 2020 •

edited

Loading

pgaskin commented Jan 12, 2020 •

edited

Loading

pgaskin commented Jan 12, 2020 •

edited

Loading