Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Footnotes give text a different color. #45

Closed
bertcoerver opened this issue Dec 30, 2019 · 3 comments
Closed

Footnotes give text a different color. #45

bertcoerver opened this issue Dec 30, 2019 · 3 comments
Labels
bug epub issue Issue with the source epub

Comments

@bertcoerver
Copy link

Hello,

I'm having the following problem when converting books, related to (references to) footnotes.

For some of my converted books, each first footnote in a chapter causes all the preceding text in that chapter to have a different color from the default color. After the first footnote, the text has the default color and any subsequent footnotes in that chapter work as intended.

The original epub with a footnote:

Screenshot 2019-12-30 at 14 27 01

The converted epub with all the text preceding the first footnote in the chapter in a different color:

Screenshot 2019-12-30 at 14 27 39

@pgaskin
Copy link
Owner

pgaskin commented Jan 4, 2020

I'm guessing this is due to a self-closing A tag (e.g. <a id="footnote1" />), which is against the spec, but some badly-designed books use it. When it is parsed by kepubify (or any other semi-strict parser), it parses it as an open tag only, and closes it in the last possible place.

I could add a workaround to kepubify, but that's almost outside the scope of kepubify. I'll see if the KEPUB reader accepts invalid void elements when I have time, and if it does, I'll also make kepubify do it as well.

@pgaskin pgaskin added bug epub issue Issue with the source epub undecided labels Jan 4, 2020
@pgaskin
Copy link
Owner

pgaskin commented Jan 12, 2020

This is what happens when it is parsed:

https://play.golang.org/p/2VHmxGqKm5P

package main

import (
	"fmt"
	"strings"

	"github.com/PuerkitoBio/goquery"
)

func main() {
	doc, _ := goquery.NewDocumentFromReader(strings.NewReader(`<!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test"/>test 2<p>test 3</p></body></html>`))
	html, _ := doc.Html()
	fmt.Println(html)
}

// output: <!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test">test 2<p>test 3</p></a></body></html>

I'm going to test it with the KEPUB renderer now.

Update: The KEPUB renderer doesn't seem to be as strict, so I'll see what I can do with fixing this.

@pgaskin pgaskin removed the undecided label Jan 12, 2020
@pgaskin
Copy link
Owner

pgaskin commented Jan 12, 2020

I won't be fixing this issue immediately, as it requires restructuring how the parsing is done.

The issue is that kepubify always parses the HTML as HTML5, which means it follows HTML5 parsing rules, namely about treating self-closing tags which aren't void elements as if the closing doesn't exist (see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-non-void-html-element-start-tag-with-trailing-solidus). The KEPUB renderer either parses it according to the DOCTYPE or always as XHTML (I haven't checked yet).

To fix this problem, I have a few options:

  • Fork golang.org/x/net/html and allow self-closing a tags by not ignoring the /. This would also allow me to remove the ugly workaround for self-closing title tags.
  • Parse everything as XHTML. This is probably the least optimal solution, as it would prevent me from using goquery, and would also break proper HTML5 books, which are a lot more common.
  • Parse everything according to the DOCTYPE. This would work, but would result in either a lot of code duplication or a lot of abstraction (or embedding a C library like lxml, which I'd rather not do).
  • Fix this specific case with a regexp. I'd rather not add any more regexp-hacks to kepubify, as there are already enough as it is.

I'm most likely going to do either the first or the fourth option. I'll probably leave it for the next version, though, and improve the whole content manipulation code while I'm at it.


Update: I've decided to go with the first option. Here's the mostly working patch I have so far, but I still need to write tests for it and incorporate a few other changes.

diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
 
 func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
                        p.tokenizer.NextIsNotRawText()
                        return true
                case a.Script, a.Title:
+                       // MOD(geek1011): Allow title to be self-closing. See the mod below
+                       //     for A elements for more details. This is often found in XHTML
+                       //     EPUBs, where generators don't add any text to the title tag
+                       //     and use an XML renderer (which then self-closes it, which is
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
        // context is the context element when parsing an HTML fragment
        // (section 12.4).
        context *Node
+       // lenientSelfClosing controls whether to allow additional non-void elements
+       // to be self closing (<whatever ... />). This is mainly for better
+       // compatibility with XHTML found in EPUBs. MOD(geek1011)
+       lenientSelfClosing bool
 }
 
 func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
                        p.tokenizer.NextIsNotRawText()
                        return true
                case a.Script, a.Title:
+                       // MOD(geek1011): Allow title to be self-closing. See the mod below
+                       //     for A elements for more details. This is often found in XHTML
+                       //     EPUBs, where generators don't add any text to the title tag
+                       //     and use an XML renderer (which then self-closes it, which is
+                       //     ignored in the HTML5 spec, and results in everything after
+                       //     becoming the title when parsed using a compliant HTML5 parser).
+                       //
+                       //     Also allow script to be self-closing, as I've seen it in quite
+                       //     a few cases when HTML is generated using an XML encoder and
+                       //     it's a script[src].
+                       if p.lenientSelfClosing && p.hasSelfClosingToken {
+                               // Add the element, but immediately remove it from the stack (
+                               // this is necessary because it would usually be removed after
+                               // the raw text, which we're skipping).
+                               p.addElement()
+                               p.oe.pop()
+
+                               // This doesn't actually do anything here, but we acknowledge it
+                               // for consistency.
+                               p.acknowledgeSelfClosingTag()
+
+                               // There isn't an end tag for Title, so it considers it raw text
+                               // afterwards (even if we don't add the element) unless we tell
+                               // it not to consider it as such.
+                               p.tokenizer.NextIsNotRawText()
+
+                               return true
+                       }
+                       // END MOD
+
                        p.addElement()
                        p.setOriginalIM()
                        p.im = textIM
@@ -975,6 +1009,31 @@ func inBodyIM(p *parser) bool {
                                }
                        }
                        p.reconstructActiveFormattingElements()
+                       // MOD(geek1011): Allow A to be self-closing. THIS IS NOT SPEC-
+                       //     COMPLIANT, but won't cause any issues in basically any real-
+                       //     world case, as people don't go doing things like
+                       //     `<a href="/whatever" />link text</a>` and expect it to work (
+                       //     the spec says to ignore the self-closing on any non-void
+                       //     element). This fixes cases where people (or generators) do
+                       //     things like `<a id="aRandomID" />some more text` and expect
+                       //     it to be treated like XHTML/XML, where the self-closing would
+                       //     work.
+                       //
+                       //     Based on the case for void elements (br, wbr, input, ...),
+                       //     but only closes them if they have the self closing token (/>)
+                       //     rather than closing them and ignoring content in all cases.
+                       //
+                       //     This is done after reconstructActiveFormattingElements (but
+                       //     before addFormattingElement, which would be useless as there
+                       //     isn't any content in a self-closing A) to prevent breaking
+                       //     open formatting elements before the A tag.
+                       if p.lenientSelfClosing && p.hasSelfClosingToken {
+                               p.addElement()
+                               p.oe.pop()
+                               p.acknowledgeSelfClosingTag()
+                               break
+                       }
+                       // END MOD
                        p.addFormattingElement()
                case a.B, a.Big, a.Code, a.Em, a.Font, a.I, a.S, a.Small, a.Strike, a.Strong, a.Tt, a.U:
                        p.reconstructActiveFormattingElements()
@@ -1093,6 +1152,15 @@ func inBodyIM(p *parser) bool {
                case a.Caption, a.Col, a.Colgroup, a.Frame, a.Head, a.Tbody, a.Td, a.Tfoot, a.Th, a.Thead, a.Tr:
                        // Ignore the token.
                default:
+                       // MOD(geek1011): Allow span to be self-closing. See the mod above
+                       //     for A elements for more details.
+                       if p.lenientSelfClosing && p.tok.DataAtom == a.Span && p.hasSelfClosingToken {
+                               p.addElement()
+                               p.oe.pop()
+                               p.acknowledgeSelfClosingTag()
+                               break
+                       }
+                       // END MOD
                        p.reconstructActiveFormattingElements()
                        p.addElement()
                }
@@ -2331,6 +2399,15 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
        }
 }
 
+// ParseOptionLenientSelfClosing controls whether to allow additional non-void
+// elements to be self closing (<whatever ... />). This is mainly for better
+// compatibility with XHTML found in EPUBs. MOD(geek1011)
+func ParseOptionLenientSelfClosing(enable bool) ParseOption {
+       return func(p *parser) {
+               p.lenientSelfClosing = enable
+       }
+}
+
 // ParseWithOptions is like Parse, with options.
 func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
        p := &parser{

pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
pgaskin added a commit to pgaskin/net that referenced this issue Jan 12, 2020
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
pgaskin added a commit that referenced this issue Jun 11, 2021
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See #45 and #28 for a few examples.
pgaskin added a commit that referenced this issue Jun 11, 2021
This option slightly relaxes parsing rules about which elements can
be self-closing to fix common issues with XHTML in EPUBs generated
by XML tooling.

See #45 and #28 for a few examples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug epub issue Issue with the source epub
Projects
None yet
Development

No branches or pull requests

2 participants