Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape apostrophes by default? #167

Open
ashawley opened this issue Oct 15, 2017 · 3 comments
Open

Escape apostrophes by default? #167

ashawley opened this issue Oct 15, 2017 · 3 comments

Comments

@ashawley
Copy link
Member

ashawley commented Oct 15, 2017

There is a fix related to apostrophes and XMLEventReader in #72 that is slated to appear in version 1.1.0. However, while fixing the defect there was not a change to the handling of apostrophes for the entire library. Fundamentally, this is represented by the behavior of Utitlity.escape method:

scala> scala.xml.Utility.escape("'")
res0: String = '

Users are curious why the apostrophe isn't generally being escaped to become ':

scala> scala.xml.Utility.escape("'")
res1: String = '

The reason apostrophe isn't escaped is because the scala-xml code has been around for a long time, and although it is primarily an XML library its purpose was one of mixed use with HTML. The HTML 4.0 standard still does not define an apos entity, see https://www.w3.org/TR/html401/sgml/entities.html

The apos entity has been in XML 1.0 since the beginning, see https://www.w3.org/TR/1998/REC-xml-19980210#sec-predefined-ent

Most HTML these days is XHTML, and the apos entity was defined in XHTML 1.0, since it needed to conform to the XML standard, see https://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#a_dtd_Special_characters

It seems like apos was explicitly commented out for some historical issue about HTML4 and a Web browser called Internet Explorer -- see the enigmatic comment added in e1fadeb.

Ten years later, can we bring back XML escape support for apos?

There is some commentary on the Entities representing special characters in XHTML at Wikipedia, but I couldn't find an an analysis of browser support for apos.

Presumably, all browsers support apos if the document is properly declared as XHTML. So, it seems that fixing this would require accepting that scala-xml no longer supports HTML4 or earlier?

If that's the case, than the only other issue is finding out the second-order consequences of "fixing" this, and how this would affect users. Byte-for-byte, there XML would suddenly look a little different.

There are at least 4 different contexts for an apostrophe:

  1. In an element's text (PCDATA), e.g. <p>The ' character</p>
  2. CDATA, e.g. <![CDATA[the ' character]]>
  3. In an attribute value, e.g. <a href="#the-'-character">
  4. As an attribute quote character, e.g. <a href='foo.html'>

And then there are at least 3 different programming modes

  1. XML literals
  2. XML files
  3. Programmatically-created XML (e.g. Elem(null, "p", Null, TopScope, Text("The ' character")))

And then there at least 3 different types of parsing and reading:

  1. scala.xml.factory.XMLLoader
  2. scala.xml.pull.XMLEventReader
  3. scala.xml.parsing.ConstructingParser
@ashawley ashawley added this to the 1.1.1 milestone Oct 15, 2017
@mbeckerle
Copy link

I would be happy with a setting/flag optional argument for XML loader calls indicating a desire for stricter XML 1.0 semantics, vs. lax XML/HTML4 compatibility semantics with the default being the lax behavior as that is what is current.

Our usage is only for XML, no HTML involved, so we would just always pass this flag in strict setting.

@ashawley ashawley modified the milestones: 1.1.1, 1.2.0 Jun 27, 2018
@rkoeninger
Copy link

In Utility, apos is explicity exlucded from the escMap function:

object Escapes {
/**
* For reasons unclear escape and unescape are a long ways from
* being logical inverses.
*/
val pairs = Map(
"lt" -> '<',
"gt" -> '>',
"amp" -> '&',
"quot" -> '"',
"apos" -> '\''
)
val escMap = (pairs - "apos") map { case (s, c) => c -> ("&%s;" format s) }
val unescMap = pairs
}

And then in MarkupParser, the escape pairs, including apos, are used as the unescape function:

import Utility.Escapes.{ pairs => unescape }

So it always handles apos when parsing.

The additional flag would be passed to Utility.serialize and functions that call it:

def serialize(
x: Node,
pscope: NamespaceBinding = TopScope,
sb: StringBuilder = new StringBuilder,
stripComments: Boolean = false,
decodeEntities: Boolean = true,
preserveWhitespace: Boolean = false,
minimizeTags: MinimizeMode.Value = MinimizeMode.Default): StringBuilder =

Considering how many optional arguments the function now takes, maybe there should be a Options or Settings class to contain them. There could be an overload to take such an object.

Also, should apos escaping default to true to match the expected behavior these days or remain false to preserve current behavior?

@ashawley ashawley modified the milestones: 1.2.0, 2.0 Mar 2, 2019
@SethTisue SethTisue removed this from the 2.0 milestone Mar 3, 2021
@SethTisue
Copy link
Member

removing this from the 2.0 milestone since we're nearing release (#432) and it doesn't seem like a blocker

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants