Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PCRE crashing on invalid UTF-8 #13240

Merged

Conversation

straight-shoota
Copy link
Member

Fixes #13237

  • Raises Regex::Error when pcre_exec errors. This was previously ignored in the PCRE bindings and the methods just returned false.
  • Drops NO_UTF_CHECK for both compile and exec/match functions in PCRE and PCRE2
  • Adds MATCH_INVALID_UTF for PCRE2 (this is unavailable in PCRE)

Due to MATCH_INVALID_UTF in PCRE2 we get different behaviour in both engine versions when the subject string contains invalid UTF-8. PCRE will always raise. PCRE2 will try to match but invalid bytes can never be part of a match.

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. breaking-change topic:stdlib:text labels Mar 29, 2023
@straight-shoota straight-shoota self-assigned this Mar 29, 2023
@beta-ziliani
Copy link
Member

beta-ziliani commented Mar 29, 2023

I'll be a broken record, but I think it's best to leave PCRE alone and only introduce the change in PCRE2

EDIT: This bug is also present in PCRE, even if it's only reported in PCRE2 (thx Johannes!)

@beta-ziliani
Copy link
Member

it seems the generated compiler is not able to compile regexes, if I'm understanding the errors correctly

@straight-shoota
Copy link
Member Author

straight-shoota commented Mar 29, 2023

No the segfault like https://github.com/crystal-lang/crystal/actions/runs/4554373556/jobs/8032148978?pr=13240 happens already in std_spec before a new compiler is built.
It only fails in the ubuntu build image, not in alpine. I also cannot reproduce it in my local ubuntu machine (but locally in the image, so it's not just in CI 😅).

@straight-shoota
Copy link
Member Author

The libpcre2 version in the build image is 10.34 while locally I'm using 10.42. The same memory error happens with 10.34 on alpine.
It seems to be a bug in libpcre2 10.34 and appears to be fixed in 10.35, possibly this: https://github.com/PCRE2Project/pcre2/blob/512be06078bec9a3a941765900d566060aa5af8b/ChangeLog#L635-L636

I added a workaround to detect the library version and skip this spec when it's below 10.35 🤞

(I was planning to add a way to query the version at runtime anyway in a follow-up, so this is not as much overhead for this simple spec as it may seem)

@straight-shoota
Copy link
Member Author

Next problem: PCRE2_MATCH_INVALID_UTF was only introduced in 10.34 (https://github.com/PCRE2Project/pcre2/blob/512be06078bec9a3a941765900d566060aa5af8b/ChangeLog#L708-L710) which breaks the aarch64 CI because the jhass/crystal:1.0.0-build image is Ubuntu 18.04 with libpcre2 version 10.31.

So I guess we'll need a version guard for using PCRE2_MATCH_INVALID_UTF and the behaviour of invalid UTF-8 will be different between PCRE and PCRE2 < 10.34 on one side and PCRE2 >= 10.34 on the other.😑

src/regex/pcre2.cr Show resolved Hide resolved
src/regex/pcre2.cr Outdated Show resolved Hide resolved
@straight-shoota straight-shoota changed the title Add error handling to pcre_match Fix PCRE crashing on invalid UTF-8 Mar 30, 2023
Copy link
Member

@beta-ziliani beta-ziliani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks as good as it can, given the constraints 👍

@straight-shoota straight-shoota added this to the 1.8.0 milestone Apr 2, 2023
@straight-shoota straight-shoota merged commit ec097ca into crystal-lang:master Apr 3, 2023
@straight-shoota straight-shoota deleted the fix/regex-invalid-utf8 branch April 3, 2023 09:00
@HertzDevil
Copy link
Contributor

It seems the 1.7.3 CI is now permanently stuck on generating the docs

@straight-shoota
Copy link
Member Author

MATCH_INVALID_UTF was eventually dropped in #13313

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:text
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex may crash when matching invalid UTF-8 string
4 participants