Skip to content

Commit

Permalink
feat: support for XML 1.1
Browse files Browse the repository at this point in the history
BREAKING CHANGE: previous versions of saxes would parse files with an XML
declaration set to 1.1 as 1.0 documents. The support for 1.1 entails that if a
document has an XML declaration that specifies version 1.1 it is parsed as a 1.1
document.
  • Loading branch information
lddubeau committed Jul 30, 2019
1 parent 1a25e8b commit 36704fb
Show file tree
Hide file tree
Showing 5 changed files with 331 additions and 39 deletions.
67 changes: 44 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,10 @@ Saxes does not support Node versions older than 8.
well-formedness. Sax, even in its so-called "strict mode", is not strict. It
silently accepts structures that are not well-formed XML. Projects that need
better compliance with well-formedness constraints cannot use sax as-is.
Saxes aims for conformance with [XML 1.0 fifth
edition](https://www.w3.org/TR/2008/REC-xml-20081126/) and [XML Namespaces 1.0
third edition](http://www.w3.org/TR/2009/REC-xml-names-20091208/).

Consequently, saxes does not support HTML, or pseudo-XML, or bad XML.
Consequently, saxes does not support HTML, or pseudo-XML, or bad XML. Saxes
will report well-formedness errors in all these cases but it won't try to
extract data from malformed documents like sax does.

* Saxes is much much faster than sax, mostly because of a substantial redesign
of the internal parsing logic. The speed improvement is not merely due to
Expand All @@ -45,28 +44,23 @@ Saxes does not support Node versions older than 8.
* Saxes does not have facilities for limiting the size the data chunks passed to
event handlers. See the FAQ entry for more details.

## Limitations

This is a non-validating parser so it only verifies whether the document is
well-formed. We do aim to raise errors for all malformed constructs encountered.
## Conformance

However, this parser does not parse the contents of DTDs. So malformedness
errors caused by errors in DTDs cannot be reported.
Saxes supports:

Also, the parser continues to parse even upon encountering errors, and does its
best to continue reporting errors. You should heed all errors
reported.
* [XML 1.0 fifth edition](https://www.w3.org/TR/2008/REC-xml-20081126/)
* [XML 1.1 second edition](https://www.w3.org/TR/2006/REC-xml11-20060816/)
* [Namespaces in XML 1.0 (Third Edition)](https://www.w3.org/TR/2009/REC-xml-names-20091208/).
* [Namespaces in XML 1.1 (Second Edition)](https://www.w3.org/TR/2006/REC-xml-names11-20060816/).

**HOWEVER, ONCE AN ERROR HAS BEEN ENCOUNTERED YOU CANNOT RELY ON THE DATA
PROVIDED THROUGH THE OTHER EVENT HANDLERS.**
## Limitations

After an error, saxes tries to make sense of your document, but it may interpret
it incorrectly. For instance ``<foo a=bc="d"/>`` is invalid XML. Did you mean to
have ``<foo a="bc=d"/>`` or ``<foo a="b" c="d"/>`` or some other variation?
Saxes takes an honest stab at figuring out your mangled XML. That's as good as
it gets.
This is a non-validating parser so it only verifies whether the document is
well-formed. We do aim to raise errors for all malformed constructs
encountered. However, this parser does not thorougly parse the contents of
DTDs. So most malformedness errors caused by errors in DTDs cannot be reported.

## Regarding `<!DOCTYPE`s and `<!ENTITY`s
## Regarding `<!DOCTYPE` and `<!ENTITY`

The parser will handle the basic XML entities in text nodes and attribute
values: `&amp; &lt; &gt; &apos; &quot;`. It's possible to define additional
Expand Down Expand Up @@ -138,10 +132,16 @@ Settings supported:
namespaces known before parsing the XML file. It is not legal to pass
bindings for the namespaces `"xml"` or `"xmlns"`.

* `defaultXMLVersion` - The default version of the XML specification to use if
the document contains no XML declaration. If the document does contain an XML
declaration, then this setting is ignored. Must be `"1.0"` or `"1.1"`. The
default is `"1.0"`.

### Methods

`write` - Write bytes onto the stream. You don't have to do this all at
once. You can keep writing as much as you want.
`write` - Write bytes onto the stream. You don't have to pass the whole document
in one `write` call. You can read your source chunk by chunk and call `write`
with each chunk.

`close` - Close the stream. Once closed, no more data may be written until it is
done processing the buffer, which is signaled by the `end` event.
Expand All @@ -168,6 +168,27 @@ generated by the parser happens, the declaration has been processed if present
at all. Otherwise, you have a malformed document, and as stated above, you
cannot rely on the parser data!

### Error Handling

The parser continues to parse even upon encountering errors, and does its best
to continue reporting errors. You should heed all errors reported. After an
error, however, saxes may interpret your document incorrectly. For instance
``<foo a=bc="d"/>`` is invalid XML. Did you mean to have ``<foo a="bc=d"/>`` or
``<foo a="b" c="d"/>`` or some other variation? For the sake of continuing to
provide errors, saxes will continue parsing the document, but the structure it
reports may be incorrect. It is only after the errors are fixed in the document
that saxes can provide a reliable interpretation of the document.

That leaves you with two rules of thumb when using saxes:

* Pay attention to the errors that saxes report. The default `onerror` handler
throws, so by default, you cannot miss errors.

* **ONCE AN ERROR HAS BEEN ENCOUNTERED, STOP RELYING ON THE EVENT HANDLERS OTHER
THAN `onerror`.** As explained above, when saxes runs into a well-formedness
problem, it makes a guess in order to continue reporting more errors. The guess
may be wrong.

### Events

To listen to an event, override `on<eventname>`. The list of supported events
Expand Down
1 change: 1 addition & 0 deletions lib/saxes.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ declare namespace saxes {
fragment?: boolean;
fileName?: string;
additionalNamespaces?: Record<string, string>;
defaultXMLVersion?: "1.0" | "1.1";
}

export interface XMLDecl {
Expand Down
142 changes: 129 additions & 13 deletions lib/saxes.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,11 @@
"use strict";

const { isS, isChar, isNameStartChar, isNameChar, S_LIST, NAME_RE } =
require("xmlchars/xml/1.0/ed5");
const { isNCNameStartChar, isNCNameChar, NC_NAME_RE } = require("xmlchars/xmlns/1.0/ed3");
const {
isS, isChar: isChar10, isNameStartChar, isNameChar, S_LIST, NAME_RE,
} = require("xmlchars/xml/1.0/ed5");
const { isChar: isChar11, isRestrictedChar } = require("xmlchars/xml/1.1/ed2");
const { isNCNameStartChar, isNCNameChar, NC_NAME_RE } =
require("xmlchars/xmlns/1.0/ed3");

const XML_NAMESPACE = "http://www.w3.org/XML/1998/namespace";
const XMLNS_NAMESPACE = "http://www.w3.org/2000/xmlns/";
Expand Down Expand Up @@ -101,6 +104,8 @@ const GREATER = 0x3E;
const QUESTION = 0x3F;
const OPEN_BRACKET = 0x5B;
const CLOSE_BRACKET = 0x5D;
const NEL = 0x85;
const LS = 0x2028; // Line Separator

function isQuote(c) {
return c === DQUOTE || c === SQUOTE;
Expand Down Expand Up @@ -259,6 +264,10 @@ const FORBIDDEN_BRACKET_BRACKET = 2;
* @property {string} [fileName] A file name to use for error reporting. "File name" is a loose
* concept. You could use a URL to some resource, or any descriptive name you
* like.
*
* @property {"1.0" | "1.1"} [defaultXMLVersion] The default XML version to
* use. If unspecified, and there is no XML encoding declaration, the default
* version is "1.0".
*/

class SaxesParser {
Expand Down Expand Up @@ -360,7 +369,6 @@ class SaxesParser {
this.nameCheck = isNCNameChar;
this.isName = isNCName;
this.processAttribs = this.processAttribsNS;
this.pushAttrib = this.pushAttribNS;

this.ns = Object.assign({ __proto__: null }, rootNS);
const additional = this.opt.additionalNamespaces;
Expand All @@ -374,9 +382,14 @@ class SaxesParser {
this.nameCheck = isNameChar;
this.isName = isName;
this.processAttribs = this.processAttribsPlain;
this.pushAttrib = this.pushAttribPlain;
}

let { defaultXMLVersion } = this.opt;
if (defaultXMLVersion === undefined) {
defaultXMLVersion = "1.0";
}
this.setXMLVersion(defaultXMLVersion);

this.trackPosition = this.opt.position !== false;
/** The line number the parser is currently looking at. */
this.line = 1;
Expand Down Expand Up @@ -575,11 +588,13 @@ class SaxesParser {
* Get a single code point out of the current chunk. This updates the current
* position if we do position tracking.
*
* This is the algorithm to use for XML 1.0.
*
* @private
*
* @returns {number} The character read.
*/
getCode() {
getCode10() {
const { chunk, i } = this;
// Using charCodeAt and handling the surrogates ourselves is faster
// than using codePointAt.
Expand Down Expand Up @@ -614,7 +629,68 @@ class SaxesParser {
skip++;
}

if (!isChar(code)) {
if (!isChar10(code)) {
this.fail("disallowed character.");
}
}

this.i += skip;

return code;
}


/**
* Get a single code point out of the current chunk. This updates the current
* position if we do position tracking.
*
* This is the algorithm to use for XML 1.1.
*
* @private
*
* @returns {number} The character read.
*/
getCode11() {
const { chunk, i } = this;
// Using charCodeAt and handling the surrogates ourselves is faster
// than using codePointAt.
let code = chunk.charCodeAt(i);

let skip = 1;
switch (code) {
case CR: { // 0xD
// We may get NaN if we read past the end of the chunk, which is
// fine.
const next = chunk.charCodeAt(i + 1);
if (next === NL || next === NEL) {
// A CR NL or CR NEL sequence is converted to NL so we have to skip over
// the next character. We already know it has a size of 1 so ++ is fine
// here.
skip++;
}
// Otherwise, a CR is just converted to NL, no skip.
}
/* yes, fall through */
case NEL: // 0x85
case LS: // Ox2028
case NL: // 0xA
code = NL;
this.line++;
this.column = 0;
break;

default:
this.column++;
if (code >= 0xD800 && code <= 0xDBFF) {
code = 0x10000 + ((code - 0xD800) * 0x400) +
(chunk.charCodeAt(i + 1) - 0xDC00);
this.column++;
skip++;
}

// In XML 1.1 the character we read must satisfy the Char production but
// not the RestrictedChar production.
if (!isChar11(code) || isRestrictedChar(code)) {
this.fail("disallowed character.");
}
}
Expand Down Expand Up @@ -769,6 +845,22 @@ class SaxesParser {
return undefined;
}

/** @private */
setXMLVersion(version) {
if (version === "1.0") {
this.isChar = isChar10;
this.getCode = this.getCode10;
this.pushAttrib =
this.xmlnsOpt ? this.pushAttribNS10 : this.pushAttribPlain;
}
else {
this.isChar = isChar11;
this.getCode = this.getCode11;
this.pushAttrib =
this.xmlnsOpt ? this.pushAttribNS11 : this.pushAttribPlain;
}
}

// STATE HANDLERS

/** @private */
Expand Down Expand Up @@ -1380,13 +1472,19 @@ class SaxesParser {

if (c) {
switch (this.xmlDeclName) {
case "version":
if (!/^1\.[0-9]+$/.test(this.xmlDeclValue)) {
case "version": {
this.xmlDeclExpects = ["encoding", "standalone"];
const version = this.xmlDeclValue;
this.xmlDecl.version = version;
// This is the test specified by XML 1.0 but it is fine for XML 1.1.
if (!/^1\.[0-9]+$/.test(version)) {
this.fail("version number must match /^1\\.[0-9]+$/.");
}
this.xmlDeclExpects = ["encoding", "standalone"];
this.xmlDecl.version = this.xmlDeclValue;
else {
this.setXMLVersion(version);
}
break;
}
case "encoding":
if (!/^[A-Za-z][A-Za-z0-9._-]*$/.test(this.xmlDeclValue)) {
this.fail("encoding value must match \
Expand Down Expand Up @@ -1561,7 +1659,25 @@ class SaxesParser {
}

/** @private */
pushAttribNS(name, value) {
pushAttribNS10(name, value) {
const { prefix, local } = this.qname(name);
this.attribList.push({ name, prefix, local, value, uri: undefined });
if (prefix === "xmlns") {
const trimmed = value.trim();
if (trimmed === "") {
this.fail("invalid attempt to undefine prefix in XML 1.0");
}
this.tag.ns[local] = trimmed;
nsPairCheck(this, local, trimmed);
}
else if (name === "xmlns") {
const trimmed = value.trim();
this.tag.ns[""] = trimmed;
nsPairCheck(this, "", trimmed);
}
}

pushAttribNS11(name, value) {
const { prefix, local } = this.qname(name);
this.attribList.push({ name, prefix, local, value, uri: undefined });
if (prefix === "xmlns") {
Expand Down Expand Up @@ -2060,7 +2176,7 @@ class SaxesParser {
}

// The character reference is required to match the CHAR production.
if (!isChar(num)) {
if (!this.isChar(num)) {
this.fail("malformed character entity.");
return `&${entity};`;
}
Expand Down
Loading

0 comments on commit 36704fb

Please sign in to comment.