Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
[ci skip]
  • Loading branch information
jaynetics committed Jun 18, 2023
1 parent 4e6ef1a commit 9e62735
Showing 1 changed file with 27 additions and 38 deletions.
65 changes: 27 additions & 38 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,18 +14,13 @@ A Ruby gem for tokenizing, parsing, and transforming regular expressions.
* Runs on Ruby 2.x, 3.x and JRuby runtimes
* Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)


_For examples of regexp_parser in use, see [Example Projects](#example-projects)._


---
## Requirements

* Ruby >= 2.0
* Ragel >= 6.0, but only if you want to build the gem or work on the scanner.


---
## Install

Install the gem with:
Expand All @@ -39,8 +34,6 @@ Or, add it to your project's `Gemfile`:
See the badge at the top of this README or [rubygems](https://rubygems.org/gems/regexp_parser)
for the the latest version number.


---
## Usage

The three main modules are **Scanner**, **Lexer**, and **Parser**. Each of them
Expand Down Expand Up @@ -85,17 +78,17 @@ Regexp::Parser.parse(
)
```

---
## Components

### Scanner

A Ragel-generated scanner that recognizes the cumulative syntax of all
supported syntax versions. It breaks a given expression's text into the
smallest parts, and identifies their type, token, text, and start/end
offsets within the pattern.


#### Example

The following scans the given pattern and prints out the type, token, text and
start/end offsets for each token found.

Expand Down Expand Up @@ -129,8 +122,8 @@ Regexp::Scanner.scan(/(cat?([bhm]at)){3,5}/).map { |token| token[2] }
# => ["(", "cat", "?", "(", "[", "b", "h", "m", "]", "at", ")", ")", "{3,5}"]
```


#### Notes

* The scanner performs basic syntax error checking, like detecting missing
balancing punctuation and premature end of pattern. Flavor validity checks
are performed in the lexer, which uses a syntax object.
Expand All @@ -151,13 +144,14 @@ Regexp::Scanner.scan(/(cat?([bhm]at)){3,5}/).map { |token| token[2] }
the documentation or are undocumented, like `{}` and `]` _(unescaped)_.
The scanner will try to support as many of these cases as possible.

---
### Syntax

Defines the supported tokens for a specific engine implementation (aka a
flavor). Syntax classes act as lookup tables, and are layered to create
flavor variations. Syntax only comes into play in the lexer.

#### Example

The following fetches syntax objects for Ruby 2.0, 1.9, 1.8, and
checks a few of their implementation features.

Expand Down Expand Up @@ -195,15 +189,15 @@ ruby_20.features # => { anchor: [...], ... }
```

#### Notes

* Variations on a token, for example a named group with angle brackets (< and >)
vs one with a pair of single quotes, are specified with an underscore followed
by two characters appended to the base token. In the previous named group example,
the tokens would be :named_ab (angle brackets) and :named_sq (single quotes).
These variations are normalized by the syntax to :named.


---
### Lexer

Sits on top of the scanner and performs lexical analysis on the tokens that
it emits. Among its tasks are; breaking quantified literal runs, collecting the
emitted token attributes into Token objects, calculating their nesting depth,
Expand All @@ -213,8 +207,8 @@ the given syntax version.
See the [Token Objects](https://github.com/ammar/regexp_parser/wiki/Token-Objects)
wiki page for more information on Token objects.


#### Example

The following example lexes the given pattern, checks it against the Ruby 1.9
syntax, and prints the token objects' text indented to their level.

Expand Down Expand Up @@ -252,22 +246,21 @@ Regexp::Lexer.scan(/(cat?([b]at)){3,5}/).map { |token| token.text }
```

#### Notes

* The syntax argument is optional. It defaults to the version of the Ruby
interpreter in use, as returned by RUBY_VERSION.

* The lexer normalizes some tokens, as noted in the Syntax section above.


---
### Parser

Sits on top of the lexer and transforms the "stream" of Token objects emitted
by it into a tree of Expression objects represented by an instance of the
`Expression::Root` class.

See the [Expression Objects](https://github.com/ammar/regexp_parser/wiki/Expression-Objects)
wiki page for attributes and methods.


#### Example

This example uses the tree traversal method `#each_expression`
Expand Down Expand Up @@ -326,16 +319,13 @@ end
_See the traverse.rb and strfregexp.rb files under `lib/regexp_parser/expression/methods`
for more information on these methods._

---


## Supported Syntax

The three modules support all the regular expression syntax features of Ruby 1.8,
1.9, 2.x and 3.x:

_Note that not all of these are available in all versions of Ruby_


| Syntax Feature | Examples | &#x22ef; |
| ------------------------------------- | ------------------------------------------------------- |:--------:|
| **Alternation** | `a\|b\|c` | &#x2713; |
Expand Down Expand Up @@ -379,7 +369,7 @@ _Note that not all of these are available in all versions of Ruby_
| &emsp;&nbsp;_**Meta** \[2\]_ | `\M-c`, `\M-\C-C`, `\M-\cC`, `\C-\M-C`, `\c\M-C` | &#x2713; |
| &emsp;&nbsp;_**Octal**_ | `\0`, `\01`, `\012` | &#x2713; |
| &emsp;&nbsp;_**Unicode**_ | `\uHHHH`, `\u{H+ H+}` | &#x2713; |
| **Unicode Properties** | _<sub>([Unicode 13.0.0])</sub>_ | &#x22f1; |
| **Unicode Properties** | _<sub>([Unicode 15.0.0])</sub>_ | &#x22f1; |
| &emsp;&nbsp;_**Age**_ | `\p{Age=5.2}`, `\P{age=7.0}`, `\p{^age=8.0}` | &#x2713; |
| &emsp;&nbsp;_**Blocks**_ | `\p{InArmenian}`, `\P{InKhmer}`, `\p{^InThai}` | &#x2713; |
| &emsp;&nbsp;_**Classes**_ | `\p{Alpha}`, `\P{Space}`, `\p{^Alnum}` | &#x2713; |
Expand All @@ -388,7 +378,7 @@ _Note that not all of these are available in all versions of Ruby_
| &emsp;&nbsp;_**Scripts**_ | `\p{Arabic}`, `\P{Hiragana}`, `\p{^Greek}` | &#x2713; |
| &emsp;&nbsp;_**Simple**_ | `\p{Dash}`, `\p{Extender}`, `\p{^Hyphen}` | &#x2713; |

[Unicode 13.0.0]: https://www.unicode.org/versions/Unicode13.0.0/
[Unicode 15.0.0]: https://www.unicode.org/versions/Unicode15.0.0/

**\[1\]**: Ruby does not support lazy or possessive interval quantifiers.
Any `+` or `?` that follows an interval quantifier will be treated as another,
Expand All @@ -400,11 +390,12 @@ escapes when used in Regexp literals](https://github.com/ruby/ruby/commit/11ae58
so they will only reach the scanner and will only be emitted if a String or a Regexp
that has been built with the `::new` constructor is scanned.

##### Inapplicable Features
### Inapplicable Features

Some modifiers, like `o` and `s`, apply to the **Regexp** object itself and do not
appear in its source. Other such modifiers include the encoding modifiers `e` and `n`
[See](http://www.ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Encoding).
Some Regexp options are not relevant to parsing. The option `o` modifies how Ruby
deduplicates the **Regexp** object and does not appear in its source or options.
Other such modifiers include the encoding modifiers `e`, `n`, `s` and `u`
[See](https://ruby-doc.org/3.2.2/Regexp.html#class-Regexp-label-Encoding).
These are not seen by the scanner.

The following features are not currently enabled for Ruby by its regular
Expand All @@ -418,8 +409,8 @@ See something missing? Please submit an [issue](https://github.com/ammar/regexp_
_**Note**: Attempting to process expressions with unsupported syntax features can raise
an error, or incorrectly return tokens/objects as literals._


## Testing

To run the tests simply run rake from the root directory.

The default task generates the scanner's code from the Ragel source files and runs
Expand All @@ -433,14 +424,13 @@ rake ragel:rb && rspec spec/scanner/properties_spec.rb
```

## Building

Building the scanner and the gem requires [Ragel](http://www.colm.net/open-source/ragel/)
to be installed. The build tasks will automatically invoke the 'ragel:rb' task to generate
the Ruby scanner code.


The project uses the standard rubygems package tasks, so:


To build the gem, run:
```
rake build
Expand All @@ -451,8 +441,10 @@ To install the gem from the cloned project, run:
rake install
```

## References

### Example Projects

## Example Projects
Projects using regexp_parser.

- [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool
Expand All @@ -476,31 +468,28 @@ uses regexp_parser to lint Regexps.
- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper
that uses regexp_parser to generate examples of postal codes.


## References
Documentation and books used while working on this project.

### Ruby Flavors

#### Ruby Flavors
* Oniguruma Regular Expressions (Ruby 1.9.x) [link](https://github.com/kkos/oniguruma/blob/master/doc/RE)
* Onigmo Regular Expressions (Ruby >= 2.0) [link](https://github.com/k-takata/Onigmo/blob/master/doc/RE)

### Regular Expressions

#### Regular Expressions
* Mastering Regular Expressions, By Jeffrey E.F. Friedl (2nd Edition) [book](http://oreilly.com/catalog/9781565922570/)
* Regular Expression Flavor Comparison [link](http://www.regular-expressions.info/refflavors.html)
* Enumerating the strings of regular languages [link](http://www.cs.dartmouth.edu/~doug/nfa.ps.gz)
* Stack Overflow Regular Expressions FAQ [link](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075)

### Unicode

#### Unicode
* Unicode Explained, By Jukka K. Korpela. [book](http://oreilly.com/catalog/9780596101213)
* Unicode Derived Properties [link](http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt)
* Unicode Property Aliases [link](http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt)
* Unicode Regular Expressions [link](http://www.unicode.org/reports/tr18/)
* Unicode Standard Annex #44 [link](http://www.unicode.org/reports/tr44/)

## Copyright

---
##### Copyright
_Copyright (c) 2010-2023 Ammar Ali. See LICENSE file for details._

0 comments on commit 9e62735

Please sign in to comment.