-
-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex search #6706
Regex search #6706
Conversation
I'm running this branch in a custom docker-compose build based on the I'd recommend being explicit about what type of regex patterns you support. The most common standard types are Posix (very limited), Extended, and Perl, though there are lots of variations like Java, RE2 (Golang), etc that slightly tweak the syntax, usually primarily only of more advanced features. Maybe if you add a link in the documentation to the regex engine's definition that would be helpful as well. Working:
Not working:
BugsPattern ParenthesisIt's common to use parentheses in regex patterns to form conditional groups. For example: The parenthesis currently require manual escaping however, they're being captured by the filter logic outside the regex. This also appears to be an existing bug in how the non-regex filters work as well, but it's much more significant for regex patterns because parentheses are used so frequently in them. For example, this fails to match anything I did verify that this same parenthesis bug affects quoted non-regex patterns though too. Doing Multi-line modeI'm not sure how to test multi-line mode exactly. It works fine if I put the
The pattern I suspect you want |
Hello @mtalexan and thanks for testing 👍🏻
|
Excellent, thanks for the clarification. For reference, I'm testing with PostgreSQL.
It seems only parentheses have weird escaping rules. Pattern escape characters like The super strange behavior with parenthesis and escaping definitely need to be documented. No one will ever figure out how to use it without a lot of trial-and-error, and it's such a standard feature of regex patterns to use parentheses you can guarantee almost everyone will encounter the issue. I'd suggest dropping the mention in the documentation on filtering along with the links to the regex syntaxes supported by the different databases. Is there anything else you'd like me to test? It looks like the same code path is used for all cases of filters being used, so I assume there's not much point in re-testing with per-feed filters, label filters, or saved user filters. Correct? |
Oh, forgot to mention I'd tested negation of a regex filter in my original set of tests. It's working fine too. |
As I work on using the regexes more, I'm finding more and more cases where it's not clear if there was a regex error, or if the pattern is matching unexpected things. For example, if I use the word-boundary escape character ( This for example doesn't work: Related: it appears constraint escapes aren't working. Class-Shorthand escapes and Backreferences are working however. |
Thanks for the additional tests and feedback, @mtalexan 👍🏻 Indeed, we do not have any feedback to the user when the search expression is wrong, for instance with wrong parentheses, invalid regex, etc. But we should add that. Adding an error feedback in particular should not be too difficult, but should still be left to a follow-up PR. |
I'm definitely on board with splitting the work up. Having access to regex parsing at all is a major feature that shouldn't be held up by reduced usability. |
Did you mean to include the |
Added Example : |
I meant |
Hum, PostgreSQL does not support |
Note that SQLite's performance is surprisingly good, even with many articles, with or without regex |
Tags fixed (not tested). Tests welcome |
And regex added for tags |
Check support for multiline mode in MySQL
Allow searching for instance `/<pre>/` Fix FreshRSS#6775 (comment)
|
||
Example to search entries, which title starts with the *Lol* word, with any number of *o*: `intitle:/^Lo+l/i` | ||
|
||
As opposed to normal searches, HTML special characters are not escaped in regex searches, to allow searching HTML code, like: `/Hello <span>world<\/span>/` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also has the effect that I have to use the HTML encoding of any character literals I'm looking for in the body as well, right? So &
and "
to find the post-rendered characters &
and "
for example.
Might it be worth doing this only if you have an r
character on the regex? Like /Hello <span>world<\/span>/r
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, the search is supposed to match the HTML content verbatim. No need to extra encode anything though, just exactly at it appears in the original HTML document.
We could consider adding some modifiers, but let's see if there is a sufficient need first. Another modifier I am considering is an automated transformation from PCRE syntax (subset) to the other variants such as PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might run into issues trying to translate between the regex versions. If you settle on the PCRE syntax there are a lot of advanced (?...)
features that aren't present in Postgres. Postgres also has explicit start and end word boundary markers, while PCRE has only a singular word boundary marker. Then there's MySQL, which supports only a very small subset of regex features.
I could see maybe trying to do something with the constraint escapes for Postgres, but that's about the only thing I can see that could possibly be translated and it would probably just confuse the situation with the databases having different regex formats.
Hi @Alkarex, I was trying to find issues/discussions where the regex filters were mentioned, and I ended up here. I have an issue where I'm trying to assign labels automatically to incoming feed items by using intitle regex filters, but it has some erratic behaviour that it's difficult for me to identify, and I was hoping you could give me a hint of what's happening. So I tried the same as you did in your first comment for the FreshRSS releases feed: If I use the filter Any idea why is this happening in my instance? I'm using the default Docker deploy. |
What version of FreshRSS are you running? Try |
Version 1.24.3.
Awesome! With this one (1.25.0-dev), it works as expected: Problem solved then, thanks a lot for the quick reply :D EDIT: duh, I just realized from your first comment that regex search is only available from FreshRSS 1.25.0+... sorry for not checking that before. |
fix #3549
Support regex filters and regex search (working with PostgreSQL and MySQL and SQLite).
Works like any regular expression. Must be enclosed in
/ /
Currently supporting the prefix
intitle:
or no prefix at all (will addauthor:
etc. soon)Case sensitive by default, but can be made case-insensitive with
i
modifier like/Alice/i
Supports multiline mode with
m
modifier like/^Alice/m
Example of filter to keep only the entries, which title ends with
.0
from https://github.com/FreshRSS/FreshRSS/releases.atom , marking as read the entries not matching this filter:Requires PHP 8.0+ due to use of new function
str_contains
, as planned requirement for FreshRSS 1.25.0+ #6711