From d2b0589f9b17f764155c3ab0b7e1ccadf3a5e5d3 Mon Sep 17 00:00:00 2001 From: Ron Buckton Date: Tue, 1 Oct 2019 09:28:17 -0700 Subject: [PATCH] Add spec text for regexp-match-indices --- spec.html | 156 ++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 135 insertions(+), 21 deletions(-) diff --git a/spec.html b/spec.html index 6ac0cd65f27..a0303503b1c 100644 --- a/spec.html +++ b/spec.html @@ -30941,7 +30941,10 @@

Notation

A CharSet is a mathematical set of characters, either code units or code points depending up the state of the _Unicode_ flag. “All characters” means either all code unit values or all code point values also depending upon the state of _Unicode_.
  • - A State is an ordered pair (_endIndex_, _captures_) where _endIndex_ is an integer and _captures_ is a List of _NcapturingParens_ values. States are used to represent partial match states in the regular expression matching algorithms. The _endIndex_ is one plus the index of the last input character matched so far by the pattern, while _captures_ holds the results of capturing parentheses. The _n_th element of _captures_ is either a List that represents the value obtained by the _n_th set of capturing parentheses or *undefined* if the _n_th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process. + A Range is an ordered pair (_startIndex_, _endIndex_) that represents the range of characters included in a capture, where _startIndex_ is an integer representing the start index (inclusive) of the range within _Input_ and _endIndex_ is an integer representing the end index (exclusive) of the range within _Input_. For any Range, these indices must satisfy the invariant that _startIndex_ ≤ _endIndex_. +
  • +
  • + A State is an ordered pair (_endIndex_, _captures_) where _endIndex_ is an integer and _captures_ is a List of _NcapturingParens_ values. States are used to represent partial match states in the regular expression matching algorithms. The _endIndex_ is one plus the index of the last input character matched so far by the pattern, while _captures_ holds the results of capturing parentheses. The _n_th element of _captures_ is either a List that represents the Range obtained by the _n_th set of capturing parentheses or *undefined* if the _n_th set of capturing parentheses hasn't been reached yet. Due to backtracking, many States may be in use at any time during the matching process.
  • A MatchResult is either a State or the special token ~failure~ that indicates that the match failed. @@ -31550,12 +31553,12 @@

    Atom

    1. Let _ye_ be _y_'s _endIndex_. 1. If _direction_ is equal to +1, then 1. Assert: _xe_ ≤ _ye_. - 1. Let _s_ be a new List whose elements are the characters of _Input_ at indices _xe_ (inclusive) through _ye_ (exclusive). + 1. Let _r_ be the Range (_xe_, _ye_). 1. Else, 1. Assert: _direction_ is equal to -1. 1. Assert: _ye_ ≤ _xe_. - 1. Let _s_ be a new List whose elements are the characters of _Input_ at indices _ye_ (inclusive) through _xe_ (exclusive). - 1. Set _cap_[_parenIndex_ + 1] to _s_. + 1. Let _r_ be the Range (_ye_, _xe_). + 1. Set _cap_[_parenIndex_ + 1] to _r_. 1. Let _z_ be the State (_ye_, _cap_). 1. Call _c_(_z_) and return its result. 1. Call _m_(_x_, _d_) and return its result. @@ -31707,14 +31710,16 @@

    Runtime Semantics: BackreferenceMatcher ( _n_, _direction_ )

    1. Return an internal Matcher closure that takes two arguments, a State _x_ and a Continuation _c_, and performs the following steps: 1. Let _cap_ be _x_'s _captures_ List. - 1. Let _s_ be _cap_[_n_]. - 1. If _s_ is *undefined*, return _c_(_x_). + 1. Let _r_ be _cap_[_n_]. + 1. If _r_ is *undefined*, return _c_(_x_). 1. Let _e_ be _x_'s _endIndex_. - 1. Let _len_ be the number of elements in _s_. + 1. Let _rs_ be _r_'s _startIndex_. + 1. Let _re_ be _r_'s _endIndex_. + 1. Let _len_ be _re_ - _rs_. 1. Let _f_ be _e_ + _direction_ × _len_. 1. If _f_ < 0 or _f_ > _InputLength_, return ~failure~. 1. Let _g_ be min(_e_, _f_). - 1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_s_[_i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~. + 1. If there exists an integer _i_ between 0 (inclusive) and _len_ (exclusive) such that Canonicalize(_Input_[_rs_ + _i_]) is not the same character value as Canonicalize(_Input_[_g_ + _i_]), return ~failure~. 1. Let _y_ be the State (_f_, _cap_). 1. Call _c_(_y_) and return its result. @@ -31949,6 +31954,37 @@

    ClassEscape

    + +

    RegExp Abstract Operations

    + + +

    Match Records

    +

    A Match is a Record value used to encapsulate the start and end indices of a regular expression match or capture.

    +

    Match Records have the fields listed in .

    + + + + + + + + + + + + + + + + + + + +
    Field NameValueMeaning
    [[StartIndex]]An integer ≥ 0.The number of code units from the start of a string at which the match begins (inclusive).
    [[EndIndex]]An integer ≥ [[StartIndex]].The number of code units from the start of a string at which the match ends (exclusive).
    +
    +
    +
    +

    The RegExp Constructor

    The RegExp constructor:

    @@ -32153,9 +32189,7 @@

    Runtime Semantics: RegExpBuiltinExec ( _R_, _S_ )

    1. Assert: _r_ is a State. 1. Set _matchSucceeded_ to *true*. 1. Let _e_ be _r_'s _endIndex_ value. - 1. If _fullUnicode_ is *true*, then - 1. _e_ is an index into the _Input_ character list, derived from _S_, matched by _matcher_. Let _eUTF_ be the smallest index into _S_ that corresponds to the character at element _e_ of _Input_. If _e_ is greater than or equal to the number of elements in _Input_, then _eUTF_ is the number of code units in _S_. - 1. Set _e_ to _eUTF_. + 1. If _fullUnicode_ is *true*, set _e_ to ! GetStringIndex(_S_, _Input_, _e_). 1. If _global_ is *true* or _sticky_ is *true*, then 1. Perform ? Set(_R_, `"lastIndex"`, _e_, *true*). 1. Let _n_ be the number of elements in _r_'s _captures_ List. (This is the same value as 's _NcapturingParens_.) @@ -32164,27 +32198,42 @@

    Runtime Semantics: RegExpBuiltinExec ( _R_, _S_ )

    1. Assert: The value of _A_'s `"length"` property is _n_ + 1. 1. Perform ! CreateDataProperty(_A_, `"index"`, _lastIndex_). 1. Perform ! CreateDataProperty(_A_, `"input"`, _S_). - 1. Let _matchedSubstr_ be the matched substring (i.e. the portion of _S_ between offset _lastIndex_ inclusive and offset _e_ exclusive). - 1. Perform ! CreateDataProperty(_A_, `"0"`, _matchedSubstr_). - 1. If _R_ contains any |GroupName|, then + 1. Let _indices_ be a new empty List. + 1. Let _match_ be the Match { [[StartIndex]]: _lastIndex_, [[EndIndex]]: _e_ }. + 1. Add _match_ as the last element of _indices_. + 1. Let _matchedValue_ be ! GetMatchString(_S_, _match_). + 1. Perform ! CreateDataProperty(_A_, `"0"`, _matchedValue_). + 1. If _R_ contains any |GroupName|, then + 1. Let _groupNames_ be a new empty List. 1. Let _groups_ be ObjectCreate(*null*). 1. Else, 1. Let _groups_ be *undefined*. + 1. Let _groupNames_ be *undefined*. 1. Perform ! CreateDataProperty(_A_, `"groups"`, _groups_). 1. For each integer _i_ such that _i_ > 0 and _i_ ≤ _n_, do 1. Let _captureI_ be _i_th element of _r_'s _captures_ List. - 1. If _captureI_ is *undefined*, let _capturedValue_ be *undefined*. - 1. Else if _fullUnicode_ is *true*, then - 1. Assert: _captureI_ is a List of code points. - 1. Let _capturedValue_ be the String value whose code units are the UTF16Encoding of the code points of _captureI_. + 1. If _captureI_ is *undefined*, then + 1. Let _capturedValue_ be *undefined*. + 1. Add *undefined* as the last element of _indices_. 1. Else, - 1. Assert: _fullUnicode_ is *false*. - 1. Assert: _captureI_ is a List of code units. - 1. Let _capturedValue_ be the String value consisting of the code units of _captureI_. + 1. Let _captureStart_ be _captureI_'s _startIndex_. + 1. Let _captureEnd_ be _captureI_'s _endIndex_. + 1. If _fullUnicode_ is *true*, then + 1. Set _captureStart_ to ! GetStringIndex(_S_, _Input_, _captureStart_). + 1. Set _captureEnd_ to ! GetStringIndex(_S_, _Input_, _captureEnd_). + 1. Let _capture_ be the Match { [[StartIndex]]: _captureStart_, [[EndIndex]:: _captureEnd_ }. + 1. Append _capture_ to _indices_. + 1. Let _capturedValue_ be ! GetMatchString(_S_, _capture_). 1. Perform ! CreateDataProperty(_A_, ! ToString(_i_), _capturedValue_). 1. If the _i_th capture of _R_ was defined with a |GroupName|, then 1. Let _s_ be the StringValue of the corresponding |RegExpIdentifierName|. 1. Perform ! CreateDataProperty(_groups_, _s_, _capturedValue_). + 1. Assert: _groupNames_ is a List. + 1. Append _s_ to _groupNames_. + 1. Else, + 1. If _groupNames_ is a List, append *undefined* to _groupNames_. + 1. Let _indicesArray_ be MakeIndicesArray(_S_, _indices_, _groupNames_). + 1. Perform ! CreateDataProperty(_A_, `"indices"`, _indicesArray_). 1. Return _A_.
    @@ -32203,6 +32252,71 @@

    AdvanceStringIndex ( _S_, _index_, _unicode_ )

    1. Return _index_ + _cp_.[[CodeUnitCount]]. + + +

    GetStringIndex ( _S_, _Input_, _e_ )

    +

    The abstract operation GetStringIndex with with arguments _S_, _Input_, and _e_ performs the following steps:

    + + 1. Assert: Type(_S_) is String. + 1. Assert: _Input_ is a List of the code points of _S_ interpreted as a UTF-16 encoded string. + 1. Assert: _e_ is an integer value ≥ 0 and < the number of elements in _Input_. + 1. Let _eUTF_ be the smallest index into _S_ that corresponds to the character at element _e_ of _Input_. If _e_ is greater than or equal to the number of elements in _Input_, then _eUTF_ is the number of code units in _S_. + 1. Return _eUTF_. + +
    + + +

    GetMatchString ( _S_, _match_ )

    +

    The abstract operation GetMatchString with arguments _S_ and _match_ performs the following steps:

    + + 1. Assert: Type(_S_) is String. + 1. Assert: _match_ is a Match Record. + 1. Assert: _match_.[[StartIndex]] is an integer value ≥ 0 and < the length of _S_. + 1. Assert: _match_.[[EndIndex]] is an integer value ≥ _match_.[[StartIndex]] and ≤ the length of _S_. + 1. Return the portion of _S_ between offset _match_.[[StartIndex]] inclusive and offset _match_.[[EndIndex]] exclusive. + +
    + + +

    GetMatchIndicesArray ( _S_, _match_ )

    +

    The abstract operation GetMatchIndicesArray with arguments _S_ and _match_ performs the following steps:

    + + 1. Assert: Type(_S_) is String. + 1. Assert: _match_ is a Match Record. + 1. Assert: _match_.[[StartIndex]] is an integer value ≥ 0 and < the length of _S_. + 1. Assert: _match_.[[EndIndex]] is an integer value ≥ _match_.[[StartIndex]] and ≤ the length of _S_. + 1. Return CreateArrayFromList(« _match_.[[StartIndex]], _match_.[[EndIndex]] »). + +
    + + +

    MakeIndicesArray ( _S_ , _indices_, _groupNames_ )

    +

    The abstract operation MakeIndicesArray with arguments _S_, _groupNames_, and _indices_ performs the following steps:

    + + 1. Assert: Type(_S_) is String. + 1. Assert: _indices_ is a List. + 1. Assert: _groupNames_ is a List or is *undefined*. + 1. Let _n_ be the number of elements in _indices_. + 1. Assert: _n_ < 232-1. + 1. Set _A_ to ! ArrayCreate(_n_). + 1. Assert: The value of _A_'s `"length"` property is _n_. + 1. If _groupNames_ is not *undefined*, then + 1. Let _groups_ be ! ObjectCreate(*null*). + 1. Else, + 1. Let _groups_ be *undefined*. + 1. Perform ! CreateDataProperty(_A_, `"groups"`, _groups_). + 1. For each integer _i_ such that _i_ ≥ 0 and _i_ < _n_, do + 1. Let _matchIndices_ be _indices_[_i_]. + 1. If _matchIndices_ is not *undefined*, then + 1. Let _matchIndicesArray_ be ! GetMatchIndicesArray(_S_, _matchIndices_). + 1. Else, + 1. Let _matchIndicesArray_ be *undefined*. + 1. Perform ! CreateDataProperty(_A_, ! ToString(_i_), _matchIndicesArray_). + 1. If _groupNames_ is not *undefined* and _groupNames_[_i_] is not *undefined*, then + 1. Perform ! CreateDataProperty(_groups_, _groupNames_[_i_], _matchIndicesArray_). + 1. Return _A_. + +