-
-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix unicode dot #376
Fix unicode dot #376
Conversation
see maciejhirsz#375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND.
|
Looks great, thanks! Sadly, benchmarks show a dramatic decrease in performances :/ So, while this is a nice fix, I would be happy not to make Logos much slower ^^' https://github.com/maciejhirsz/logos/actions/runs/7925237413?pr=376
|
As suggested by @RustyYato
In general there is an issue between repeating matches and non-repeating matches, as well as shared logic between the dot Consider these two:
Now what this PR does is fall back on full unicode checking in both cases. If you can make the check for tail end of range being |
@RustyYato I did try your suggestion. On local benchmarks, the impact on performances was much less important, so let's see. |
Just speculating here, but unicode checking expands into a lot of code (more than it needs to and if I were a better programmer it would be smarter), so this is likely to swing a lot based on your icache. |
@maciejhirsz indeed... Weirdly enough, the benchmarks ran with much closer timings on my computer than in GitHub CI. My previous commit does not seem to have changed the benchmarks (or maybe I missed something). On my computer:
In CI:
|
I'll try to do that :-) |
This reverts commit 80bd23f.
Tests are still passing
Looks like I managed (?) to fix the performances issues, while keeping the fix :-)
Maybe it would be worth to double-check that, @RustyYato or @maciejhirsz. I think I might also want to put the tests to another file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from the start
of the range needing to be checked as per comment.
Just made necessary changes in my last commits :-) |
Merging this, thank you, @RustyYato and @maciejhirsz for your help! |
Thanks! My suggestion wouldn't have affected the runtime performance of Logos, but it looks like you managed to fix that anyways. |
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Type | Update | Change | |---|---|---|---| | [logos](https://logos.maciej.codes/) ([source](https://togithub.com/maciejhirsz/logos)) | dependencies | patch | `0.14.0` -> `0.14.1` | --- ### Release Notes <details> <summary>maciejhirsz/logos (logos)</summary> ### [`v0.14.1`](https://togithub.com/maciejhirsz/logos/releases/tag/v0.14.1): 0.14.1 - Debug feature and fixes #### What's Changed - fix(doc): reset logos2 to logos by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/372](https://togithub.com/maciejhirsz/logos/pull/372) - chore(book): add JSON-borrowed parser example by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/373](https://togithub.com/maciejhirsz/logos/pull/373) - Add Rc<T> and Arc<T> sources by [@​InfiniteCoder01](https://togithub.com/InfiniteCoder01) in [https://github.com/maciejhirsz/logos/pull/340](https://togithub.com/maciejhirsz/logos/pull/340) - Fix unicode dot by [@​RustyYato](https://togithub.com/RustyYato) in [https://github.com/maciejhirsz/logos/pull/376](https://togithub.com/maciejhirsz/logos/pull/376) - chore(docs): cleanup examples by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/381](https://togithub.com/maciejhirsz/logos/pull/381) - chore(lib): add debug feature by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/382](https://togithub.com/maciejhirsz/logos/pull/382) - Cleanup unused Source features by [@​kmicklas](https://togithub.com/kmicklas) in [https://github.com/maciejhirsz/logos/pull/335](https://togithub.com/maciejhirsz/logos/pull/335) - chore(deps): bump peaceiris/actions-mdbook from 1 to 2 by [@​dependabot](https://togithub.com/dependabot) in [https://github.com/maciejhirsz/logos/pull/387](https://togithub.com/maciejhirsz/logos/pull/387) - Fix `Lexer::clone` leak and UB + tests by [@​Jakobeha](https://togithub.com/Jakobeha) in [https://github.com/maciejhirsz/logos/pull/390](https://togithub.com/maciejhirsz/logos/pull/390) - fix(lib): correctly handle miss for loop in loop by [@​lukas-code](https://togithub.com/lukas-code) in [https://github.com/maciejhirsz/logos/pull/393](https://togithub.com/maciejhirsz/logos/pull/393) - chore(lib): remove error branch from LUT if it is unreachable by [@​RustyYato](https://togithub.com/RustyYato) in [https://github.com/maciejhirsz/logos/pull/386](https://togithub.com/maciejhirsz/logos/pull/386) - fix(docs): typo by [@​joerivanruth](https://togithub.com/joerivanruth) in [https://github.com/maciejhirsz/logos/pull/396](https://togithub.com/maciejhirsz/logos/pull/396) - chore(docs): Adds graph debug documentation to book by [@​afreeland](https://togithub.com/afreeland) in [https://github.com/maciejhirsz/logos/pull/379](https://togithub.com/maciejhirsz/logos/pull/379) - chore: drop python linting frmo pre-commit-config by [@​LeoDog896](https://togithub.com/LeoDog896) in [https://github.com/maciejhirsz/logos/pull/403](https://togithub.com/maciejhirsz/logos/pull/403) - refactor: don't use deprecated max_value() method by [@​LeoDog896](https://togithub.com/LeoDog896) in [https://github.com/maciejhirsz/logos/pull/404](https://togithub.com/maciejhirsz/logos/pull/404) - chore(version): bump logos version to 0.14.1 by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/409](https://togithub.com/maciejhirsz/logos/pull/409) - fix(docs): change old 0.14.0 by [@​jeertmans](https://togithub.com/jeertmans) in [https://github.com/maciejhirsz/logos/pull/410](https://togithub.com/maciejhirsz/logos/pull/410) #### New Contributors - [@​InfiniteCoder01](https://togithub.com/InfiniteCoder01) made their first contribution in [https://github.com/maciejhirsz/logos/pull/340](https://togithub.com/maciejhirsz/logos/pull/340) - [@​RustyYato](https://togithub.com/RustyYato) made their first contribution in [https://github.com/maciejhirsz/logos/pull/376](https://togithub.com/maciejhirsz/logos/pull/376) - [@​Jakobeha](https://togithub.com/Jakobeha) made their first contribution in [https://github.com/maciejhirsz/logos/pull/390](https://togithub.com/maciejhirsz/logos/pull/390) - [@​lukas-code](https://togithub.com/lukas-code) made their first contribution in [https://github.com/maciejhirsz/logos/pull/393](https://togithub.com/maciejhirsz/logos/pull/393) - [@​joerivanruth](https://togithub.com/joerivanruth) made their first contribution in [https://github.com/maciejhirsz/logos/pull/396](https://togithub.com/maciejhirsz/logos/pull/396) - [@​afreeland](https://togithub.com/afreeland) made their first contribution in [https://github.com/maciejhirsz/logos/pull/379](https://togithub.com/maciejhirsz/logos/pull/379) - [@​LeoDog896](https://togithub.com/LeoDog896) made their first contribution in [https://github.com/maciejhirsz/logos/pull/403](https://togithub.com/maciejhirsz/logos/pull/403) **Full Changelog**: maciejhirsz/logos@v0.14...v0.14.1 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View the [repository job log](https://developer.mend.io/github/akrantz01/antsi). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40NDAuNyIsInVwZGF0ZWRJblZlciI6IjM3LjQ0MC43IiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Fixes #375
Even when using
str
,regex(".")
cannot use the ASCII fast path because the input string may contain multibyte codepoints. Using the ASCII fast path would split the bytes of these codepoints up, which leads to UB (as soon as the illegal string is witnessed). This UB can be seen in several ways, from capturing the strings (like in #375), usingLexer::remainder
, etc.The only real fix is not to take the ASCII fast path.
I've added tests to ensure this isn't hit again in the refactor 🙂.