-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Footnotes give text a different color. #45
Comments
I'm guessing this is due to a self-closing A tag (e.g. I could add a workaround to kepubify, but that's almost outside the scope of kepubify. I'll see if the KEPUB reader accepts invalid void elements when I have time, and if it does, I'll also make kepubify do it as well. |
This is what happens when it is parsed: https://play.golang.org/p/2VHmxGqKm5P package main
import (
"fmt"
"strings"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, _ := goquery.NewDocumentFromReader(strings.NewReader(`<!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test"/>test 2<p>test 3</p></body></html>`))
html, _ := doc.Html()
fmt.Println(html)
}
// output: <!DOCTYPE html><html><head><title></title></head><body>test 1<a id="test">test 2<p>test 3</p></a></body></html> I'm going to test it with the KEPUB renderer now. Update: The KEPUB renderer doesn't seem to be as strict, so I'll see what I can do with fixing this. |
I won't be fixing this issue immediately, as it requires restructuring how the parsing is done. The issue is that kepubify always parses the HTML as HTML5, which means it follows HTML5 parsing rules, namely about treating self-closing tags which aren't void elements as if the closing doesn't exist (see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-non-void-html-element-start-tag-with-trailing-solidus). The KEPUB renderer either parses it according to the DOCTYPE or always as XHTML (I haven't checked yet). To fix this problem, I have a few options:
I'm most likely going to do either the first or the fourth option. I'll probably leave it for the next version, though, and improve the whole content manipulation code while I'm at it. Update: I've decided to go with the first option. Here's the mostly working patch I have so far, but I still need to write tests for it and incorporate a few other changes. diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
// context is the context element when parsing an HTML fragment
// (section 12.4).
context *Node
+ // lenientSelfClosing controls whether to allow additional non-void elements
+ // to be self closing (<whatever ... />). This is mainly for better
+ // compatibility with XHTML found in EPUBs. MOD(geek1011)
+ lenientSelfClosing bool
}
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
// context is the context element when parsing an HTML fragment
// (section 12.4).
context *Node
+ // lenientSelfClosing controls whether to allow additional non-void elements
+ // to be self closing (<whatever ... />). This is mainly for better
+ // compatibility with XHTML found in EPUBs. MOD(geek1011)
+ lenientSelfClosing bool
}
func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
p.tokenizer.NextIsNotRawText()
return true
case a.Script, a.Title:
+ // MOD(geek1011): Allow title to be self-closing. See the mod below
+ // for A elements for more details. This is often found in XHTML
+ // EPUBs, where generators don't add any text to the title tag
+ // and use an XML renderer (which then self-closes it, which is
:...skipping...
diff --git a/html/parse.go b/html/parse.go
index 2cd12fc..6ab9e7b 100644
--- a/html/parse.go
+++ b/html/parse.go
@@ -49,6 +49,10 @@ type parser struct {
// context is the context element when parsing an HTML fragment
// (section 12.4).
context *Node
+ // lenientSelfClosing controls whether to allow additional non-void elements
+ // to be self closing (<whatever ... />). This is mainly for better
+ // compatibility with XHTML found in EPUBs. MOD(geek1011)
+ lenientSelfClosing bool
}
func (p *parser) top() *Node {
@@ -652,6 +656,36 @@ func inHeadIM(p *parser) bool {
p.tokenizer.NextIsNotRawText()
return true
case a.Script, a.Title:
+ // MOD(geek1011): Allow title to be self-closing. See the mod below
+ // for A elements for more details. This is often found in XHTML
+ // EPUBs, where generators don't add any text to the title tag
+ // and use an XML renderer (which then self-closes it, which is
+ // ignored in the HTML5 spec, and results in everything after
+ // becoming the title when parsed using a compliant HTML5 parser).
+ //
+ // Also allow script to be self-closing, as I've seen it in quite
+ // a few cases when HTML is generated using an XML encoder and
+ // it's a script[src].
+ if p.lenientSelfClosing && p.hasSelfClosingToken {
+ // Add the element, but immediately remove it from the stack (
+ // this is necessary because it would usually be removed after
+ // the raw text, which we're skipping).
+ p.addElement()
+ p.oe.pop()
+
+ // This doesn't actually do anything here, but we acknowledge it
+ // for consistency.
+ p.acknowledgeSelfClosingTag()
+
+ // There isn't an end tag for Title, so it considers it raw text
+ // afterwards (even if we don't add the element) unless we tell
+ // it not to consider it as such.
+ p.tokenizer.NextIsNotRawText()
+
+ return true
+ }
+ // END MOD
+
p.addElement()
p.setOriginalIM()
p.im = textIM
@@ -975,6 +1009,31 @@ func inBodyIM(p *parser) bool {
}
}
p.reconstructActiveFormattingElements()
+ // MOD(geek1011): Allow A to be self-closing. THIS IS NOT SPEC-
+ // COMPLIANT, but won't cause any issues in basically any real-
+ // world case, as people don't go doing things like
+ // `<a href="/whatever" />link text</a>` and expect it to work (
+ // the spec says to ignore the self-closing on any non-void
+ // element). This fixes cases where people (or generators) do
+ // things like `<a id="aRandomID" />some more text` and expect
+ // it to be treated like XHTML/XML, where the self-closing would
+ // work.
+ //
+ // Based on the case for void elements (br, wbr, input, ...),
+ // but only closes them if they have the self closing token (/>)
+ // rather than closing them and ignoring content in all cases.
+ //
+ // This is done after reconstructActiveFormattingElements (but
+ // before addFormattingElement, which would be useless as there
+ // isn't any content in a self-closing A) to prevent breaking
+ // open formatting elements before the A tag.
+ if p.lenientSelfClosing && p.hasSelfClosingToken {
+ p.addElement()
+ p.oe.pop()
+ p.acknowledgeSelfClosingTag()
+ break
+ }
+ // END MOD
p.addFormattingElement()
case a.B, a.Big, a.Code, a.Em, a.Font, a.I, a.S, a.Small, a.Strike, a.Strong, a.Tt, a.U:
p.reconstructActiveFormattingElements()
@@ -1093,6 +1152,15 @@ func inBodyIM(p *parser) bool {
case a.Caption, a.Col, a.Colgroup, a.Frame, a.Head, a.Tbody, a.Td, a.Tfoot, a.Th, a.Thead, a.Tr:
// Ignore the token.
default:
+ // MOD(geek1011): Allow span to be self-closing. See the mod above
+ // for A elements for more details.
+ if p.lenientSelfClosing && p.tok.DataAtom == a.Span && p.hasSelfClosingToken {
+ p.addElement()
+ p.oe.pop()
+ p.acknowledgeSelfClosingTag()
+ break
+ }
+ // END MOD
p.reconstructActiveFormattingElements()
p.addElement()
}
@@ -2331,6 +2399,15 @@ func ParseOptionEnableScripting(enable bool) ParseOption {
}
}
+// ParseOptionLenientSelfClosing controls whether to allow additional non-void
+// elements to be self closing (<whatever ... />). This is mainly for better
+// compatibility with XHTML found in EPUBs. MOD(geek1011)
+func ParseOptionLenientSelfClosing(enable bool) ParseOption {
+ return func(p *parser) {
+ p.lenientSelfClosing = enable
+ }
+}
+
// ParseWithOptions is like Parse, with options.
func ParseWithOptions(r io.Reader, opts ...ParseOption) (*Node, error) {
p := &parser{ |
This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
This option slightly relaxes parsing rules about which elements can be self-closing to fix common issues with XHTML in EPUBs generated by XML tooling. See pgaskin/kepubify#45 and pgaskin/kepubify#28 for a few examples.
Hello,
I'm having the following problem when converting books, related to (references to) footnotes.
For some of my converted books, each first footnote in a chapter causes all the preceding text in that chapter to have a different color from the default color. After the first footnote, the text has the default color and any subsequent footnotes in that chapter work as intended.
The original epub with a footnote:
The converted epub with all the text preceding the first footnote in the chapter in a different color:
The text was updated successfully, but these errors were encountered: