Skip to content

Commit

Permalink
[RFC] Clarify and restrict unicode support
Browse files Browse the repository at this point in the history
This proposal alters the parser grammar to be more specific about what unicode characters are allowed as source, restricts those characters interpretted as white space or line breaks, and clarifies line break behavior relative to error reporting with a non-normative note.
  • Loading branch information
leebyron committed Sep 24, 2015
1 parent 8516a85 commit 11fba02
Show file tree
Hide file tree
Showing 3 changed files with 78 additions and 30 deletions.
13 changes: 8 additions & 5 deletions spec/Appendix A -- Notation Conventions.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,13 +166,16 @@ Example_param :
This specification describes the semantic value of many grammar productions in
the form of a list of algorithmic steps.

For example, this describes how a parser should interpret a Unicode escape
sequence which appears in a string literal:
For example, this describes how a parser should interpret a string literal:

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
StringValue :: `""`

* Let {codePoint} be the number represented by the four-digit hexadecimal sequence.
* The string value is the Unicode character represented by {codePoint}.
* Return an empty Unicode character sequence.

StringValue :: `"` StringCharacter+ `"`

* Return the Unicode character sequence of all {StringCharacter}
Unicode character values.


## Algorithms
Expand Down
20 changes: 9 additions & 11 deletions spec/Appendix B -- Grammar Summary.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,29 @@
# B. Appendix: Grammar Summary

SourceCharacter :: "Any Unicode code point"
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/


## Ignored Tokens

Ignored ::
- UnicodeBOM
- WhiteSpace
- LineTerminator
- Comment
- Comma

UnicodeBOM :: "Byte Order Mark (U+FEFF)"

WhiteSpace ::
- "Horizontal Tab (U+0009)"
- "Vertical Tab (U+000B)"
- "Form Feed (U+000C)"
- "Space (U+0020)"
- "No-break Space (U+00A0)"

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)"
- "Line Separator (U+2028)"
- "Paragraph Separator (U+2029)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Comment ::
- `#` CommentChar*
Comment :: `#` CommentChar*

CommentChar :: SourceCharacter but not LineTerminator

Expand Down Expand Up @@ -76,10 +74,10 @@ StringValue ::

StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \ EscapedUnicode
- \u EscapedUnicode
- \ EscapedCharacter

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
EscapedUnicode :: /[0-9A-Fa-f]{4}/

EscapedCharacter :: one of `"` \ `/` b f n r t

Expand Down
75 changes: 61 additions & 14 deletions spec/Section 2 -- Language.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,45 +13,61 @@ double-colon `::`).

## Source Text

SourceCharacter :: "Any Unicode character"
SourceCharacter :: /[\u0009\u000A\u000D\u0020-\uFFFF]/

GraphQL documents are expressed as a sequence of
[Unicode](http://unicode.org/standard/standard.html) characters. However, with
few exceptions, most of GraphQL is expressed only in the original ASCII range
so as to be as widely compatible with as many existing tools, languages, and
serialization formats as possible. Other than within comments, Non-ASCII Unicode
characters are only found within {StringValue}.
few exceptions, most of GraphQL is expressed only in the original non-control
ASCII range so as to be as widely compatible with as many existing tools,
languages, and serialization formats as possible and avoid display issues in
text editors and source control.


### Unicode

UnicodeBOM :: "Byte Order Mark (U+FEFF)"

Non-ASCII Unicode characters may freely appear within {StringValue} and
{Comment} portions of GraphQL.

The "Byte Order Mark" is a special Unicode character which
may appear at the beginning of a file containing Unicode which programs may use
to determine the fact that the text stream is Unicode, what endianness the text
stream is in, and which of several Unicode encodings to interpret.


### White Space

WhiteSpace ::
- "Horizontal Tab (U+0009)"
- "Vertical Tab (U+000B)"
- "Form Feed (U+000C)"
- "Space (U+0020)"
- "No-break Space (U+00A0)"

White space is used to improve legibility of source text and act as separation
between tokens, and any amount of white space may appear before or after any
token. White space between tokens is not significant to the semantic meaning of
a GraphQL query document, however white space characters may appear within a
{String} or {Comment} token.

Note: GraphQL intentionally does not consider Unicode "Zs" category characters
as white-space, avoiding misinterpretation by text editors and source
control tools.

### Line Terminators

LineTerminator ::
- "New Line (U+000A)"
- "Carriage Return (U+000D)"
- "Line Separator (U+2028)"
- "Paragraph Separator (U+2029)"
- "Carriage Return (U+000D)" [ lookahead ! "New Line (U+000A)" ]
- "Carriage Return (U+000D)" "New Line (U+000A)"

Like white space, line terminators are used to improve the legibility of source
text, any amount may appear before or after any other token and have no
significance to the semantic meaning of a GraphQL query document. Line
terminators are not found within any other token.

Note: Any error reporting which provide the line number in the source of the
offending syntax should use the preceding amount of {LineTerminator} to produce
the line number.


### Comments

Expand Down Expand Up @@ -101,9 +117,11 @@ defined here in a lexical grammar by patterns of source Unicode characters.
Tokens are later used as terminal symbols in a GraphQL query document syntactic
grammars.


### Ignored Tokens

Ignored ::
- UnicodeBOM
- WhiteSpace
- LineTerminator
- Comment
Expand Down Expand Up @@ -639,17 +657,46 @@ StringValue ::

StringCharacter ::
- SourceCharacter but not `"` or \ or LineTerminator
- \ EscapedUnicode
- \u EscapedUnicode
- \ EscapedCharacter

EscapedUnicode :: u /[0-9A-Fa-f]{4}/
EscapedUnicode :: /[0-9A-Fa-f]{4}/

EscapedCharacter :: one of `"` \ `/` b f n r t

Strings are lists of characters wrapped in double-quotes `"`. (ex.
Strings are sequences of characters wrapped in double-quotes (`"`). (ex.
`"Hello World"`). White space and other otherwise-ignored characters are
significant within a string value.

Note: Unicode characters are allowed within String value literals, however
GraphQL source must not contain some ASCII control characters so escape
sequences must be used to represent these characters.

**Semantics**

StringValue :: `""`

* Return an empty Unicode character sequence.

StringValue :: `"` StringCharacter+ `"`

* Return the Unicode character sequence of all {StringCharacter}
Unicode character values.

StringCharacter :: SourceCharacter but not `"` or \ or LineTerminator

* Return the character value of {SourceCharacter}.

StringCharacter :: \u EscapedUnicode

* Return the character value represented by the UTF16 hexidecimal
identifier {EscapedUnicode}.

StringCharacter :: \ EscapedCharacter

* Return the character value of {EscapedCharacter}.


#### Enum Value

EnumValue : Name but not `true`, `false` or `null`
Expand Down

0 comments on commit 11fba02

Please sign in to comment.