Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ripgrep is slower than grep when searching for whitespace between words guarded by unicode word boundaries #1760

Closed
awalgarg opened this issue Dec 10, 2020 · 2 comments
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.

Comments

@awalgarg
Copy link

awalgarg commented Dec 10, 2020

What version of ripgrep are you using?

$ rg --version
ripgrep 12.1.1
-SIMD -AVX (compiled)
+SIMD +AVX (runtime)

How did you install ripgrep?

# pacman -S ripgrep

What operating system are you using ripgrep on?

$ uname -mr
5.9.9-arch1-1 x86_64

Describe your bug.

When searching a file for a pattern of the following form, ripgrep is significantly slower compared to grep, unless --no-unicode is used.

\bfoo\b \bbar\b

Removing any of the \bs above or removing the space in the middle removes this problem. The issue can also be reproduced after replacing the space with other whitespace patterns such as \t and \r.

What are the steps to reproduce the behavior?

$ git clone [email protected]:git/git.git
[...]
$ cd git
$ git ls-files | xargs cat > foo
[...]
$ ls -alh foo
-rw-r--r-- 1 user user 36M Dec 11 03:06 foo
$ file foo
foo: UTF-8 Unicode text
$ time grep -ic '\bfoo\b \bbar\b' foo
114

________________________________________________________
Executed in   79.27 millis    fish           external
   usr time   66.16 millis  1167.00 micros   65.00 millis
   sys time   13.09 millis  106.00 micros   12.98 millis

$ time rg -ic '\bfoo\b \bbar\b' foo
114

________________________________________________________
Executed in    1.60 secs   fish           external
   usr time  1589.21 millis  1117.00 micros  1588.09 millis
   sys time    6.66 millis    0.00 micros    6.66 millis

With --no-unicode, ripgrep is as fast as grep or faster.

@BurntSushi BurntSushi added the bug A bug. label Dec 11, 2020
@BurntSushi
Copy link
Owner

Thanks for the detailed and easy to reproduce report! The cause of this is two-fold (repeating what I said in our email exchange):

  1. ripgrep enables Unicode by default and a Unicode-aware \b (word boundary) has a slower implementation path.
  2. ripgrep's --debug output shows that no literals are extracted from this regex, even though this looks like a pretty simple case with obvious inner literals. It looks like this occurs because the inner literal detector gives up completely when it sees a word boundary.

(1) is hard to fix, although, if you're fine with an ASCII word boundary, then using --no-unicode (or (?-u)\bfoo\b \bbar\b as the regex) will work around the performance problem since ASCII word boundaries are implemented more efficiently.

(2) should be simpler to fix, although the inner literal detector is pretty hairy at the moment. Perhaps supporting simple cases like this would be easy.

In general though, sadly, regexes with Unicode-aware \b will be slower to execute. The literal optimization is just something helps in a lot of cases.

@BurntSushi
Copy link
Owner

BurntSushi commented Oct 9, 2023

This will be fixed in the ripgrep 14 release:

$ hyperfine "rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt" "rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt"
Benchmark 1: rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt
  Time (mean ± σ):      1.034 s ±  0.011 s    [User: 1.030 s, System: 0.004 s]
  Range (min … max):    1.021 s …  1.053 s    10 runs

Benchmark 2: rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt
  Time (mean ± σ):       6.3 ms ±   0.3 ms    [User: 4.6 ms, System: 1.6 ms]
  Range (min … max):     5.6 ms …   7.3 ms    343 runs

Summary
  'rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt' ran
  164.95 ± 7.70 times faster than 'rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt'

This was not fixed by making \b itself faster, but rather, by improving
inner literal extraction. In particular, if the regex doesn't have any
literals extracted, then search time can still be quite slow:

$ time rg-13.0.0 -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
57538

real    0.427
user    0.423
sys     0.003
maxmem  46 MB
faults  0
$ time rg -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
57538

real    0.337
user    0.333
sys     0.003
maxmem  46 MB
faults  0

But then again, so is grep, because grep doesn't benefit from any
literal optimizations either:

$ time grep -E -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
62396

real    1.316
user    1.292
sys     0.007
maxmem  13 MB
faults  7

(The count mismatch is quite interesting!)

And if you disable Unicode, then everyone gets faster:

$ time LC_ALL=C grep -E -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
58798

real    0.093
user    0.090
sys     0.003
maxmem  13 MB
faults  0

$ time rg-13.0.0 --no-unicode -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
58798

real    0.054
user    0.054
sys     0.000
maxmem  46 MB
faults  0

$ time rg --no-unicode -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
58798

real    0.055
user    0.055
sys     0.000
maxmem  46 MB
faults  0

BurntSushi added a commit that referenced this issue Oct 10, 2023
Like the previous CHANGELOG entry, this marks a bug that was fixed
likely with the introduction of regex 1.9:

    $ hyperfine "rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt" "rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt"
    Benchmark 1: rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt
      Time (mean ± σ):      1.034 s ±  0.011 s    [User: 1.030 s, System: 0.004 s]
      Range (min … max):    1.021 s …  1.053 s    10 runs

    Benchmark 2: rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt
      Time (mean ± σ):       6.3 ms ±   0.3 ms    [User: 4.6 ms, System: 1.6 ms]
      Range (min … max):     5.6 ms …   7.3 ms    343 runs

    Summary
      'rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt' ran
      164.95 ± 7.70 times faster than 'rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt'

This was not fixed by making \b itself faster, but rather, by improving
inner literal extraction. In particular, if the regex doesn't have any
literals extracted, then search time can still be quite slow:

    $ time rg-13.0.0 -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
    57538

    real    0.427
    user    0.423
    sys     0.003
    maxmem  46 MB
    faults  0
    $ time rg -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
    57538

    real    0.337
    user    0.333
    sys     0.003
    maxmem  46 MB
    faults  0

But then again, so is grep, because grep doesn't benefit from any
literal optimizations either:

    $ time grep -E -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt
    62396

    real    1.316
    user    1.292
    sys     0.007
    maxmem  13 MB
    faults  7

The count mismatch should probably be investigated.

Fixes #1760
@BurntSushi BurntSushi added the rollup A PR that has been merged with many others in a rollup. label Oct 10, 2023
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Nov 28, 2023
14.0.2 (2023-11-27)
===================
This is a patch release with a few small bug fixes.

Bug fixes:

* [BUG #2654](BurntSushi/ripgrep#2654):
  Fix `deb` release sha256 sum file.
* [BUG #2658](BurntSushi/ripgrep#2658):
  Fix partial regression in the behavior of `--null-data --line-regexp`.
* [BUG #2659](BurntSushi/ripgrep#2659):
  Fix Fish shell completions.
* [BUG #2662](BurntSushi/ripgrep#2662):
  Fix typo in documentation for `-i/--ignore-case`.


14.0.1 (2023-11-26)
===================
This a patch release meant to fix `cargo install ripgrep` on Windows.

Bug fixes:

* [BUG #2653](BurntSushi/ripgrep#2653):
  Include `pkg/windows/Manifest.xml` in crate package.


14.0.0 (2023-11-26)
===================
ripgrep 14 is a new major version release of ripgrep that has some new
features, performance improvements and a lot of bug fixes.

The headlining feature in this release is hyperlink support. In this release,
they are an opt-in feature but may change to an opt-out feature in the future.
To enable them, try passing `--hyperlink-format default`. If you use [VS Code],
then try passing `--hyperlink-format vscode`. Please [report your experience
with hyperlinks][report-hyperlinks], positive or negative.

[VS Code]: https://code.visualstudio.com/
[report-hyperlinks]: BurntSushi/ripgrep#2611

Another headlining development in this release is that it contains a rewrite
of its regex engine. You generally shouldn't notice any changes, except for
some searches may get faster. You can read more about the [regex engine rewrite
on my blog][regex-internals]. Please [report your performance improvements or
regressions that you notice][report-perf].

[report-perf]: BurntSushi/ripgrep#2652

Finally, ripgrep switched the library it uses for argument parsing. Users
should not notice a difference in most cases (error messages have changed
somewhat), but flag overrides should generally be more consistent. For example,
things like `--no-ignore --ignore-vcs` work as one would expect (disables all
filtering related to ignore rules except for rules found in version control
systems such as `git`).

[regex-internals]: https://blog.burntsushi.net/regex-internals/

**BREAKING CHANGES**:

* `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to
  `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`.
  Instead, they only partially override `-C`.

Build process changes:

* ripgrep's shell completions and man page are now created by running ripgrep
with a new `--generate` flag. For example, `rg --generate man` will write a
man page in `roff` format on stdout. The release archives have not changed.
* The optional build dependency on `asciidoc` or `asciidoctor` has been
dropped. Previously, it was used to produce ripgrep's man page. ripgrep now
owns this process itself by writing `roff` directly.

Performance improvements:

* [PERF #1746](BurntSushi/ripgrep#1746):
  Make some cases with inner literals faster.
* [PERF #1760](BurntSushi/ripgrep#1760):
  Make most searches with `\b` look-arounds (among others) much faster.
* [PERF #2591](BurntSushi/ripgrep#2591):
  Parallel directory traversal now uses work stealing for faster searches.
* [PERF #2642](BurntSushi/ripgrep#2642):
  Parallel directory traversal has some contention reduced.

Feature enhancements:

* Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo,
  Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V
* [FEATURE #665](BurntSushi/ripgrep#665):
  Add a new `--hyperlink-format` flag that turns file paths into hyperlinks.
* [FEATURE #1709](BurntSushi/ripgrep#1709):
  Improve documentation of ripgrep's behavior when stdout is a tty.
* [FEATURE #1737](BurntSushi/ripgrep#1737):
  Provide binaries for Apple silicon.
* [FEATURE #1790](BurntSushi/ripgrep#1790):
  Add new `--stop-on-nonmatch` flag.
* [FEATURE #1814](BurntSushi/ripgrep#1814):
  Flags are now categorized in `-h/--help` output and ripgrep's man page.
* [FEATURE #1838](BurntSushi/ripgrep#1838):
  An error is shown when searching for NUL bytes with binary detection enabled.
* [FEATURE #2195](BurntSushi/ripgrep#2195):
  When `extra-verbose` mode is enabled in zsh, show extra file type info.
* [FEATURE #2298](BurntSushi/ripgrep#2298):
  Add instructions for installing ripgrep using `cargo binstall`.
* [FEATURE #2409](BurntSushi/ripgrep#2409):
  Added installation instructions for `winget`.
* [FEATURE #2425](BurntSushi/ripgrep#2425):
  Shell completions (and man page) can be created via `rg --generate`.
* [FEATURE #2524](BurntSushi/ripgrep#2524):
  The `--debug` flag now indicates whether stdin or `./` is being searched.
* [FEATURE #2643](BurntSushi/ripgrep#2643):
  Make `-d` a short flag for `--max-depth`.
* [FEATURE #2645](BurntSushi/ripgrep#2645):
  The `--version` output will now also contain PCRE2 availability information.

Bug fixes:

* [BUG #884](BurntSushi/ripgrep#884):
  Don't error when `-v/--invert-match` is used multiple times.
* [BUG #1275](BurntSushi/ripgrep#1275):
  Fix bug with `\b` assertion in the regex engine.
* [BUG #1376](BurntSushi/ripgrep#1376):
  Using `--no-ignore --ignore-vcs` now works as one would expect.
* [BUG #1622](BurntSushi/ripgrep#1622):
  Add note about error messages to `-z/--search-zip` documentation.
* [BUG #1648](BurntSushi/ripgrep#1648):
  Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail.
* [BUG #1701](BurntSushi/ripgrep#1701):
  Fix bug where some flags could not be repeated.
* [BUG #1757](BurntSushi/ripgrep#1757):
  Fix bug when searching a sub-directory didn't have ignores applied correctly.
* [BUG #1891](BurntSushi/ripgrep#1891):
  Fix bug when using `-w` with a regex that can match the empty string.
* [BUG #1911](BurntSushi/ripgrep#1911):
  Disable mmap searching in all non-64-bit environments.
* [BUG #1966](BurntSushi/ripgrep#1966):
  Fix bug where ripgrep can panic when printing to stderr.
* [BUG #2046](BurntSushi/ripgrep#2046):
  Clarify that `--pre` can accept any kind of path in the documentation.
* [BUG #2108](BurntSushi/ripgrep#2108):
  Improve docs for `-r/--replace` syntax.
* [BUG #2198](BurntSushi/ripgrep#2198):
  Fix bug where `--no-ignore-dot` would not ignore `.rgignore`.
* [BUG #2201](BurntSushi/ripgrep#2201):
  Improve docs for `-r/--replace` flag.
* [BUG #2288](BurntSushi/ripgrep#2288):
  `-A` and `-B` now only each partially override `-C`.
* [BUG #2236](BurntSushi/ripgrep#2236):
  Fix gitignore parsing bug where a trailing `\/` resulted in an error.
* [BUG #2243](BurntSushi/ripgrep#2243):
  Fix `--sort` flag for values other than `path`.
* [BUG #2246](BurntSushi/ripgrep#2246):
  Add note in `--debug` logs when binary files are ignored.
* [BUG #2337](BurntSushi/ripgrep#2337):
  Improve docs to mention that `--stats` is always implied by `--json`.
* [BUG #2381](BurntSushi/ripgrep#2381):
  Make `-p/--pretty` override flags like `--no-line-number`.
* [BUG #2392](BurntSushi/ripgrep#2392):
  Improve global git config parsing of the `excludesFile` field.
* [BUG #2418](BurntSushi/ripgrep#2418):
  Clarify sorting semantics of `--sort=path`.
* [BUG #2458](BurntSushi/ripgrep#2458):
  Make `--trim` run before `-M/--max-columns` takes effect.
* [BUG #2479](BurntSushi/ripgrep#2479):
  Add documentation about `.ignore`/`.rgignore` files in parent directories.
* [BUG #2480](BurntSushi/ripgrep#2480):
  Fix bug when using inline regex flags with `-e/--regexp`.
* [BUG #2505](BurntSushi/ripgrep#2505):
  Improve docs for `--vimgrep` by mentioning footguns and some work-arounds.
* [BUG #2519](BurntSushi/ripgrep#2519):
  Fix incorrect default value in documentation for `--field-match-separator`.
* [BUG #2523](BurntSushi/ripgrep#2523):
  Make executable searching take `.com` into account on Windows.
* [BUG #2574](BurntSushi/ripgrep#2574):
  Fix bug in `-w/--word-regexp` that would result in incorrect match offsets.
* [BUG #2623](BurntSushi/ripgrep#2623):
  Fix a number of bugs with the `-w/--word-regexp` flag.
* [BUG #2636](BurntSushi/ripgrep#2636):
  Strip release binaries for macOS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug A bug. rollup A PR that has been merged with many others in a rollup.
Projects
None yet
Development

No branches or pull requests

2 participants