-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ripgrep is slower than grep when searching for whitespace between words guarded by unicode word boundaries #1760
Comments
Thanks for the detailed and easy to reproduce report! The cause of this is two-fold (repeating what I said in our email exchange):
(1) is hard to fix, although, if you're fine with an ASCII word boundary, then using (2) should be simpler to fix, although the inner literal detector is pretty hairy at the moment. Perhaps supporting simple cases like this would be easy. In general though, sadly, regexes with Unicode-aware |
This will be fixed in the ripgrep 14 release:
This was not fixed by making \b itself faster, but rather, by improving
But then again, so is grep, because grep doesn't benefit from any
(The count mismatch is quite interesting!) And if you disable Unicode, then everyone gets faster:
|
Like the previous CHANGELOG entry, this marks a bug that was fixed likely with the introduction of regex 1.9: $ hyperfine "rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt" "rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt" Benchmark 1: rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt Time (mean ± σ): 1.034 s ± 0.011 s [User: 1.030 s, System: 0.004 s] Range (min … max): 1.021 s … 1.053 s 10 runs Benchmark 2: rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt Time (mean ± σ): 6.3 ms ± 0.3 ms [User: 4.6 ms, System: 1.6 ms] Range (min … max): 5.6 ms … 7.3 ms 343 runs Summary 'rg -ic '\bfoo\b \bbar\b' git-3a06386e.txt' ran 164.95 ± 7.70 times faster than 'rg-13.0.0 -ic '\bfoo\b \bbar\b' git-3a06386e.txt' This was not fixed by making \b itself faster, but rather, by improving inner literal extraction. In particular, if the regex doesn't have any literals extracted, then search time can still be quite slow: $ time rg-13.0.0 -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 57538 real 0.427 user 0.423 sys 0.003 maxmem 46 MB faults 0 $ time rg -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 57538 real 0.337 user 0.333 sys 0.003 maxmem 46 MB faults 0 But then again, so is grep, because grep doesn't benefit from any literal optimizations either: $ time grep -E -ic '\b[a-z]{3}\b\s\b[a-z]{3}\b' git-3a06386e.txt 62396 real 1.316 user 1.292 sys 0.007 maxmem 13 MB faults 7 The count mismatch should probably be investigated. Fixes #1760
14.0.2 (2023-11-27) =================== This is a patch release with a few small bug fixes. Bug fixes: * [BUG #2654](BurntSushi/ripgrep#2654): Fix `deb` release sha256 sum file. * [BUG #2658](BurntSushi/ripgrep#2658): Fix partial regression in the behavior of `--null-data --line-regexp`. * [BUG #2659](BurntSushi/ripgrep#2659): Fix Fish shell completions. * [BUG #2662](BurntSushi/ripgrep#2662): Fix typo in documentation for `-i/--ignore-case`. 14.0.1 (2023-11-26) =================== This a patch release meant to fix `cargo install ripgrep` on Windows. Bug fixes: * [BUG #2653](BurntSushi/ripgrep#2653): Include `pkg/windows/Manifest.xml` in crate package. 14.0.0 (2023-11-26) =================== ripgrep 14 is a new major version release of ripgrep that has some new features, performance improvements and a lot of bug fixes. The headlining feature in this release is hyperlink support. In this release, they are an opt-in feature but may change to an opt-out feature in the future. To enable them, try passing `--hyperlink-format default`. If you use [VS Code], then try passing `--hyperlink-format vscode`. Please [report your experience with hyperlinks][report-hyperlinks], positive or negative. [VS Code]: https://code.visualstudio.com/ [report-hyperlinks]: BurntSushi/ripgrep#2611 Another headlining development in this release is that it contains a rewrite of its regex engine. You generally shouldn't notice any changes, except for some searches may get faster. You can read more about the [regex engine rewrite on my blog][regex-internals]. Please [report your performance improvements or regressions that you notice][report-perf]. [report-perf]: BurntSushi/ripgrep#2652 Finally, ripgrep switched the library it uses for argument parsing. Users should not notice a difference in most cases (error messages have changed somewhat), but flag overrides should generally be more consistent. For example, things like `--no-ignore --ignore-vcs` work as one would expect (disables all filtering related to ignore rules except for rules found in version control systems such as `git`). [regex-internals]: https://blog.burntsushi.net/regex-internals/ **BREAKING CHANGES**: * `rg -C1 -A2` used to be equivalent to `rg -A2`, but now it is equivalent to `rg -B1 -A2`. That is, `-A` and `-B` no longer completely override `-C`. Instead, they only partially override `-C`. Build process changes: * ripgrep's shell completions and man page are now created by running ripgrep with a new `--generate` flag. For example, `rg --generate man` will write a man page in `roff` format on stdout. The release archives have not changed. * The optional build dependency on `asciidoc` or `asciidoctor` has been dropped. Previously, it was used to produce ripgrep's man page. ripgrep now owns this process itself by writing `roff` directly. Performance improvements: * [PERF #1746](BurntSushi/ripgrep#1746): Make some cases with inner literals faster. * [PERF #1760](BurntSushi/ripgrep#1760): Make most searches with `\b` look-arounds (among others) much faster. * [PERF #2591](BurntSushi/ripgrep#2591): Parallel directory traversal now uses work stealing for faster searches. * [PERF #2642](BurntSushi/ripgrep#2642): Parallel directory traversal has some contention reduced. Feature enhancements: * Added or improved file type filtering for Ada, DITA, Elixir, Fuchsia, Gentoo, Gradle, GraphQL, Markdown, Prolog, Raku, TypeScript, USD, V * [FEATURE #665](BurntSushi/ripgrep#665): Add a new `--hyperlink-format` flag that turns file paths into hyperlinks. * [FEATURE #1709](BurntSushi/ripgrep#1709): Improve documentation of ripgrep's behavior when stdout is a tty. * [FEATURE #1737](BurntSushi/ripgrep#1737): Provide binaries for Apple silicon. * [FEATURE #1790](BurntSushi/ripgrep#1790): Add new `--stop-on-nonmatch` flag. * [FEATURE #1814](BurntSushi/ripgrep#1814): Flags are now categorized in `-h/--help` output and ripgrep's man page. * [FEATURE #1838](BurntSushi/ripgrep#1838): An error is shown when searching for NUL bytes with binary detection enabled. * [FEATURE #2195](BurntSushi/ripgrep#2195): When `extra-verbose` mode is enabled in zsh, show extra file type info. * [FEATURE #2298](BurntSushi/ripgrep#2298): Add instructions for installing ripgrep using `cargo binstall`. * [FEATURE #2409](BurntSushi/ripgrep#2409): Added installation instructions for `winget`. * [FEATURE #2425](BurntSushi/ripgrep#2425): Shell completions (and man page) can be created via `rg --generate`. * [FEATURE #2524](BurntSushi/ripgrep#2524): The `--debug` flag now indicates whether stdin or `./` is being searched. * [FEATURE #2643](BurntSushi/ripgrep#2643): Make `-d` a short flag for `--max-depth`. * [FEATURE #2645](BurntSushi/ripgrep#2645): The `--version` output will now also contain PCRE2 availability information. Bug fixes: * [BUG #884](BurntSushi/ripgrep#884): Don't error when `-v/--invert-match` is used multiple times. * [BUG #1275](BurntSushi/ripgrep#1275): Fix bug with `\b` assertion in the regex engine. * [BUG #1376](BurntSushi/ripgrep#1376): Using `--no-ignore --ignore-vcs` now works as one would expect. * [BUG #1622](BurntSushi/ripgrep#1622): Add note about error messages to `-z/--search-zip` documentation. * [BUG #1648](BurntSushi/ripgrep#1648): Fix bug where sometimes short flags with values, e.g., `-M 900`, would fail. * [BUG #1701](BurntSushi/ripgrep#1701): Fix bug where some flags could not be repeated. * [BUG #1757](BurntSushi/ripgrep#1757): Fix bug when searching a sub-directory didn't have ignores applied correctly. * [BUG #1891](BurntSushi/ripgrep#1891): Fix bug when using `-w` with a regex that can match the empty string. * [BUG #1911](BurntSushi/ripgrep#1911): Disable mmap searching in all non-64-bit environments. * [BUG #1966](BurntSushi/ripgrep#1966): Fix bug where ripgrep can panic when printing to stderr. * [BUG #2046](BurntSushi/ripgrep#2046): Clarify that `--pre` can accept any kind of path in the documentation. * [BUG #2108](BurntSushi/ripgrep#2108): Improve docs for `-r/--replace` syntax. * [BUG #2198](BurntSushi/ripgrep#2198): Fix bug where `--no-ignore-dot` would not ignore `.rgignore`. * [BUG #2201](BurntSushi/ripgrep#2201): Improve docs for `-r/--replace` flag. * [BUG #2288](BurntSushi/ripgrep#2288): `-A` and `-B` now only each partially override `-C`. * [BUG #2236](BurntSushi/ripgrep#2236): Fix gitignore parsing bug where a trailing `\/` resulted in an error. * [BUG #2243](BurntSushi/ripgrep#2243): Fix `--sort` flag for values other than `path`. * [BUG #2246](BurntSushi/ripgrep#2246): Add note in `--debug` logs when binary files are ignored. * [BUG #2337](BurntSushi/ripgrep#2337): Improve docs to mention that `--stats` is always implied by `--json`. * [BUG #2381](BurntSushi/ripgrep#2381): Make `-p/--pretty` override flags like `--no-line-number`. * [BUG #2392](BurntSushi/ripgrep#2392): Improve global git config parsing of the `excludesFile` field. * [BUG #2418](BurntSushi/ripgrep#2418): Clarify sorting semantics of `--sort=path`. * [BUG #2458](BurntSushi/ripgrep#2458): Make `--trim` run before `-M/--max-columns` takes effect. * [BUG #2479](BurntSushi/ripgrep#2479): Add documentation about `.ignore`/`.rgignore` files in parent directories. * [BUG #2480](BurntSushi/ripgrep#2480): Fix bug when using inline regex flags with `-e/--regexp`. * [BUG #2505](BurntSushi/ripgrep#2505): Improve docs for `--vimgrep` by mentioning footguns and some work-arounds. * [BUG #2519](BurntSushi/ripgrep#2519): Fix incorrect default value in documentation for `--field-match-separator`. * [BUG #2523](BurntSushi/ripgrep#2523): Make executable searching take `.com` into account on Windows. * [BUG #2574](BurntSushi/ripgrep#2574): Fix bug in `-w/--word-regexp` that would result in incorrect match offsets. * [BUG #2623](BurntSushi/ripgrep#2623): Fix a number of bugs with the `-w/--word-regexp` flag. * [BUG #2636](BurntSushi/ripgrep#2636): Strip release binaries for macOS.
What version of ripgrep are you using?
How did you install ripgrep?
What operating system are you using ripgrep on?
Describe your bug.
When searching a file for a pattern of the following form, ripgrep is significantly slower compared to grep, unless
--no-unicode
is used.Removing any of the
\b
s above or removing the space in the middle removes this problem. The issue can also be reproduced after replacing the space with other whitespace patterns such as\t
and\r
.What are the steps to reproduce the behavior?
With
--no-unicode
, ripgrep is as fast as grep or faster.The text was updated successfully, but these errors were encountered: