Regex search #6706

Alkarex · 2024-08-12T20:19:11Z

Support regex filters and regex search (working with PostgreSQL and MySQL and SQLite).

Works like any regular expression. Must be enclosed in / /

Currently supporting the prefix intitle: or no prefix at all (will add author: etc. soon)

Case sensitive by default, but can be made case-insensitive with i modifier like /Alice/i

Supports multiline mode with m modifier like /^Alice/m

Example of filter to keep only the entries, which title ends with .0 from https://github.com/FreshRSS/FreshRSS/releases.atom , marking as read the entries not matching this filter:

Requires PHP 8.0+ due to use of new function str_contains, as planned requirement for FreshRSS 1.25.0+ #6711

fix FreshRSS#3549

mtalexan · 2024-08-27T04:40:49Z

I'm running this branch in a custom docker-compose build based on the Docker-Alpine. I see you pretty much only have intitle: and the prefix-less (context) matching implemented so far so I'm focusing on testing regex features and combos of the regex feature with other pre-existing filter types.

I'd recommend being explicit about what type of regex patterns you support. The most common standard types are Posix (very limited), Extended, and Perl, though there are lots of variations like Java, RE2 (Golang), etc that slightly tweak the syntax, usually primarily only of more advanced features. Maybe if you add a link in the documentation to the regex engine's definition that would be helpful as well.

Working:

intitle with regex
intitle with case insensitive regex
content with regex
content with case insensitive regex
intitle with regex and content with non-regex
intitle with case insensitive regex and non-regex
intitle with regex ORed with intitle regex
intitle without regex
content without regex
intitle regex and tag (non-regex) pattern
character exclusion regex patterns (i.e. [^a-z])
Wildcard match patterns (.* between words)
Start of boundary patterns (/^Israel/)
One-or-more wildcard match patterns ([^a-z]+)
End of boundary match patterns (routing\.$)
Alternatives matching, though see the issues with grouping. messages?( routing|$)

Not working:

grouping in regex patterns (sort of, see below)
multi-line mode (see below)
(user error?) explicit and between patterns. i.e. intitle:/election/i and intitle:/year/i
(pre-existing?) inurl (non-regex) can't match anything (i.e. `inurl:'p88#')
(pre-existing?) matches are applied to the raw text, not rendered HTML. i.e. 'hit 100,000 downloads' where the 100,000 downloads part is a hyperlink fails to match.
(not supported?) Look ahead patterns (/election (?!year)/i)

Bugs

Pattern Parenthesis

It's common to use parentheses in regex patterns to form conditional groups. For example: intitle:/messages( routing|$)/i to match either messages at the end of a line, or messages routing anywhere in the title.

The parenthesis currently require manual escaping however, they're being captured by the filter logic outside the regex. This also appears to be an existing bug in how the non-regex filters work as well, but it's much more significant for regex patterns because parentheses are used so frequently in them.

For example, this fails to match anything intitle:/messages( routing|$)/i, but this intitle:/messages$ routing|$$/i matches two different articles, one with messages routing in the middle of the title, and another that ends with messages. I also verified that literal parentheses aren't being matched by unescaped parentheses in the pattern, intitle:/(Zachary/i fails to match a title containing (Zachary Cohen/CNN) in it. In fact, matching a literal parenthesis requires escaping it from both the filter engine ($) and the regex pattern ([(]), resulting in needing to specify [\(] and [$].

I did verify that this same parenthesis bug affects quoted non-regex patterns though too. Doing infile:'(Zachary' is also failing to match the title with (Zachary Cohen/CNN) in it, and I have to do infile:'\(Zachary' to get it to match. It's just less severe there because literal parentheses are much less common than needing to use a parenthesis in a regex pattern.

Multi-line mode

I'm not sure how to test multi-line mode exactly. It works fine if I put the m on an otherwise working case sensitive or insensitive pattern, but it doesn't add any functionality to do so. With text like this in the middle of the content of an item:

Better than Signal? Looks promising.
A few bugs and UX issues but great foundation. Love that it’s public domain.

The pattern /Looks promising\./ works, as does /looks promising\./i and I can put an m on the end of either and it still works (/Looks promising\./m or /looks promising\./im).
But if I try to match across the lines: /Looks promising\..*A few bugs and UX issues/m it fails to match. Or if I try to explicitly match the newlines it also doesn't work: /Looks promising\.(\r)?\n(\r)?A few/im (though maybe there's some escaping issues here). Here's no way to insert a literal newline in the ad-hoc search box that I've been doing most of my testing with that I'm aware of, so I'd have to have some way to match past a line ending.

I suspect you want . to match line endings as well, and probably need to set a special option for the regex engine to let . match those end-of-line terminators when multi-line mode is enabled. For some reason that's frequently something regex engines require you to separately enable from multi-line mode, even though they'd really only ever expect to be used together.

Alkarex · 2024-08-27T06:48:41Z

Hello @mtalexan and thanks for testing 👍🏻
Quick answer already:

there is no explicit AND operator; just use a space instead.
The regex engine is not fully consistent and depends on which database you are using:
- PHP preg_match for auto-mark-as-read and auto-favourite, as well as for searching with an SQLite database
- For PostgreSQL search
- For MySQL search
Searches are indeed applied to the raw HTML content
The multiline mode allows searching for instance for a text at the start or end of a line in the raw HTML content

mtalexan · 2024-08-28T02:19:34Z

Excellent, thanks for the clarification. For reference, I'm testing with PostgreSQL.

Lookahead is working, it just needed the parentheses escaped.
Literal parentheses can be escaped with \\(, i.e. an escaped \\, but for some reason a single \ not preceeding a ( never needs to be escaped.
Multi-line matching is working, it's just usually hard to find a use for since multi-line content usually has HTML or something included that's harder to match against. The . is non-newline-sensitive like you'd expect/want.
Literal / in a pattern can successfully be escaped (\/)

It seems only parentheses have weird escaping rules. Pattern escape characters like \d don't need the \ part escaped. Only a ( or ) in the regex needs to be escaped, or a \ before a ( or ) that is supposed to be in the regex pattern (i.e. \$ and \$ -> $ and $ in the pattern). That's incredibly strange behavior because it means \ isn't a consistent escape character like almost every other parser ever written, instead there's weird multi-character matching that must be going on instead of nested interpretation(?).

The super strange behavior with parenthesis and escaping definitely need to be documented. No one will ever figure out how to use it without a lot of trial-and-error, and it's such a standard feature of regex patterns to use parentheses you can guarantee almost everyone will encounter the issue. I'd suggest dropping the mention in the documentation on filtering along with the links to the regex syntaxes supported by the different databases.

Is there anything else you'd like me to test?
The existing non-regex patterns still work the same, a solid sample of the range of all supported regex features work for both case-insensitive and multi-line mode, and can be used in combination with other regex and/or non-regex patterns.

It looks like the same code path is used for all cases of filters being used, so I assume there's not much point in re-testing with per-feed filters, label filters, or saved user filters. Correct?

mtalexan · 2024-08-28T02:24:52Z

Oh, forgot to mention I'd tested negation of a regex filter in my original set of tests. It's working fine too.

mtalexan · 2024-08-28T04:28:52Z

As I work on using the regexes more, I'm finding more and more cases where it's not clear if there was a regex error, or if the pattern is matching unexpected things. For example, if I use the word-boundary escape character (\y), it doesn't seem to be working and the only way I can determine that is because the same top 10 things appear in my view as if I had no search filter.

This for example doesn't work: intitle:/\yart/i, and it causes 10 items to appear in the view that don't match anything close to what I'm looking for (the case-insensitive prefix "art", making use of the \y` constraint escape for a word boundary). Recognizing the top list of items are unchanged despite the whole list blanking and re-loading, and none of them match what I'm trying to search for isn't exactly easy to determine.

Related: it appears constraint escapes aren't working. Class-Shorthand escapes and Backreferences are working however.

Alkarex · 2024-08-28T11:35:56Z

Thanks for the additional tests and feedback, @mtalexan 👍🏻
Yes, the regex applied to other operators such as author:XXX will work identically.

Indeed, we do not have any feedback to the user when the search expression is wrong, for instance with wrong parentheses, invalid regex, etc. But we should add that.
And the whole parsing of the search query has grown from very simple rules to more complicated ones, and everything should be re-written with a proper parser. The question is how much should be done now, to avoid delaying the landing of the regex features (which we could brand as beta for now).

Adding an error feedback in particular should not be too difficult, but should still be left to a follow-up PR.

mtalexan · 2024-08-28T12:27:35Z

Thanks for the additional tests and feedback, @mtalexan 👍🏻 Yes, the regex applied to other operators such as author:XXX will work identically.

Indeed, we do not have any feedback to the user when the search expression is wrong, for instance with wrong parentheses, invalid regex, etc. But we should add that. And the whole parsing of the search query has grown from very simple rules to more complicated ones, and everything should be re-written with a proper parser. The question is how much should be done now, to avoid delaying the landing of the regex features (which we could brand as beta for now).

Adding an error feedback in particular should not be too difficult, but should still be left to a follow-up PR.

I'm definitely on board with splitting the work up. Having access to regex parsing at all is a major feature that shouldn't be held up by reduced usability.

mtalexan · 2024-08-28T14:20:36Z

Did you mean to include the \b (non-printing bell character) in the regex example in filtering Readmes? intitle:/^Lo+l\b/i

Alkarex · 2024-08-28T14:21:17Z

Added author: and inurl: and some documentation (excluding the parentheses aspects for now). Additional tests welcome.

Example : author:/^Alice Doe$/im as we split multiple authors in different lines

Alkarex · 2024-08-28T14:22:00Z

Did you mean to include the \b (non-printing bell character) in the regex example in filtering Readmes? intitle:/^Lo+l\b/i

I meant \b like in word boundary. Should be tested, by the way

Alkarex · 2024-08-28T14:25:12Z

Hum, PostgreSQL does not support \b
https://www.postgresql.org/docs/current/functions-matching.html#POSIX-LIMITS-COMPATIBILITY

Alkarex · 2024-08-30T21:09:21Z

luckily there's no real harm for my use case since I only expect a few 10s of thousands of articles

Note that SQLite's performance is surprisingly good, even with many articles, with or without regex

mtalexan · 2024-09-02T01:39:49Z

@Alkarex it appears the tag searching has the same issues as the inurl:, it doesn't work if quotes are included. #6761

FreshRSS#6761

Alkarex · 2024-09-02T16:09:09Z

Tags fixed (not tested). Tests welcome

Alkarex · 2024-09-02T16:09:21Z

And regex added for tags

Check support for multiline mode in MySQL

Allow searching for instance `/<pre>/` Fix FreshRSS#6775 (comment)

mtalexan · 2024-09-06T04:05:47Z

docs/en/users/10_filter.md

+
+Example to search entries, which title starts with the *Lol* word, with any number of *o*: `intitle:/^Lo+l/i`
+
+As opposed to normal searches, HTML special characters are not escaped in regex searches, to allow searching HTML code, like: `/Hello <span>world<\/span>/`


This also has the effect that I have to use the HTML encoding of any character literals I'm looking for in the body as well, right? So & and " to find the post-rendered characters & and " for example.

Might it be worth doing this only if you have an r character on the regex? Like /Hello <span>world<\/span>/r?

Indeed, the search is supposed to match the HTML content verbatim. No need to extra encode anything though, just exactly at it appears in the original HTML document.

We could consider adding some modifiers, but let's see if there is a sufficient need first. Another modifier I am considering is an automated transformation from PCRE syntax (subset) to the other variants such as PostgreSQL.

I think you might run into issues trying to translate between the regex versions. If you settle on the PCRE syntax there are a lot of advanced (?...) features that aren't present in Postgres. Postgres also has explicit start and end word boundary markers, while PCRE has only a singular word boundary marker. Then there's MySQL, which supports only a very small subset of regex features.

I could see maybe trying to do something with the constraint escapes for Postgres, but that's about the only thing I can see that could possibly be translated and it would probably just confuse the situation with the databases having different regex formats.

Alkarex · 2024-09-07T07:40:03Z

#6785

arseru · 2024-10-08T16:33:56Z

Hi @Alkarex, I was trying to find issues/discussions where the regex filters were mentioned, and I ended up here. I have an issue where I'm trying to assign labels automatically to incoming feed items by using intitle regex filters, but it has some erratic behaviour that it's difficult for me to identify, and I was hoping you could give me a hint of what's happening.

So I tried the same as you did in your first comment for the FreshRSS releases feed:

If I use the filter intitle:/\.0$/, I get no results:

Any idea why is this happening in my instance? I'm using the default Docker deploy.

Alkarex · 2024-10-08T16:55:31Z

Any idea why is this happening in my instance? I'm using the default Docker deploy.

What version of FreshRSS are you running? Try freshrss/freshrss:edge and make sure it was pulled recently

arseru · 2024-10-08T17:10:04Z

What version of FreshRSS are you running?

Version 1.24.3.

Try freshrss/freshrss:edge and make sure it was pulled recently

Awesome! With this one (1.25.0-dev), it works as expected:

Problem solved then, thanks a lot for the quick reply :D

EDIT: duh, I just realized from your first comment that regex search is only available from FreshRSS 1.25.0+... sorry for not checking that before.

Alkarex · 2024-10-21T13:50:55Z

@mtalexan Patch to allow parentheses in regex searches #6926
Tests welcome

Alkarex added 2 commits August 12, 2024 22:15

Regex search

d55308f

fix FreshRSS#3549

Fix PHPStan

ac03302

Alkarex added the Search 🔍 label Aug 12, 2024

Alkarex added this to the 1.25.0 milestone Aug 12, 2024

Fix escape

1658286

Alkarex mentioned this pull request Aug 12, 2024

[Feature] extended keyword filtering #3549

Closed

Alkarex added 5 commits August 12, 2024 22:44

Fix ungreedy

b0b9c45

Initial support for regex search in PostgreSQL and MySQL

fda9bc7

Improvements, support MySQL

7264dc8

Fix multiline

42552ad

Add support for SQLite

febc4b3

Alkarex marked this pull request as ready for review August 13, 2024 20:29

Alkarex added 3 commits August 13, 2024 22:42

A few tests

144a3f8

Merge branch 'edge' into regex

ae51726

Merge branch 'edge' into regex

11545a1

Merge branch 'edge' into regex

7a3f11b

Merge branch 'edge' into regex

b3dac34

Alkarex added 2 commits August 28, 2024 16:10

Added author: and inurl: support, documentation

f3906e7

author example

1bb88af

Alkarex added 2 commits September 2, 2024 10:59

Merge branch 'edge' into regex

6b91c4e

Fix quoted tags + regex for tags

7125782

FreshRSS#6761

Alkarex linked an issue Sep 2, 2024 that may be closed by this pull request

[Bug] quoted tag search terms don't work #6761

Closed

Alkarex mentioned this pull request Sep 2, 2024

[Bug] quoted tag search terms don't work #6761

Closed

Alkarex added 6 commits September 3, 2024 09:11

Fix wrong regex detection

f91bbfb

Add MariaDB

a9066fc

Fix logic

c1ee94b

Merge branch 'edge' into regex

b9f9bc3

Increase requirements for MySQL and MariaDB

23747ee

Check support for multiline mode in MySQL

Remove sanitizeRegexes()

21fcc43

Alkarex mentioned this pull request Sep 5, 2024

[Bug] Filters not working correctly when one of the words is an URL #6775

Closed

Alkarex added 2 commits September 5, 2024 17:35

Allow searching HTML code

4f35243

Allow searching for instance `/<pre>/` Fix FreshRSS#6775 (comment)

Doc regex search HTML

77dc595

Alkarex linked an issue Sep 5, 2024 that may be closed by this pull request

[Bug] Filters not working correctly when one of the words is an URL #6775

Closed

Alkarex added 2 commits September 5, 2024 18:06

Fix Doctype

c1bbd0e

Merge branch 'edge' into regex

918292c

mtalexan reviewed Sep 6, 2024

View reviewed changes

Merge branch 'edge' into pr/Alkarex/6706

7089a88

Alkarex merged commit 1a552bd into FreshRSS:edge Sep 6, 2024
2 checks passed

Alkarex deleted the regex branch September 6, 2024 07:36

Alkarex mentioned this pull request Sep 6, 2024

Upgrade code to php 8.1 #6748

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex search #6706

Regex search #6706

Alkarex commented Aug 12, 2024 •

edited

Loading

mtalexan commented Aug 27, 2024 •

edited

Loading

Alkarex commented Aug 27, 2024

mtalexan commented Aug 28, 2024 •

edited

Loading

mtalexan commented Aug 28, 2024

mtalexan commented Aug 28, 2024

Alkarex commented Aug 28, 2024

mtalexan commented Aug 28, 2024

mtalexan commented Aug 28, 2024

Alkarex commented Aug 28, 2024

Alkarex commented Aug 28, 2024 •

edited

Loading

Alkarex commented Aug 28, 2024

Alkarex commented Aug 30, 2024

mtalexan commented Sep 2, 2024

Alkarex commented Sep 2, 2024

Alkarex commented Sep 2, 2024

mtalexan Sep 6, 2024

Alkarex Sep 6, 2024

mtalexan Sep 7, 2024

Alkarex commented Sep 7, 2024

arseru commented Oct 8, 2024 •

edited

Loading

Alkarex commented Oct 8, 2024

arseru commented Oct 8, 2024 •

edited

Loading

Alkarex commented Oct 21, 2024


		Example to search entries, which title starts with the Lol word, with any number of o: `intitle:/^Lo+l/i`

		As opposed to normal searches, HTML special characters are not escaped in regex searches, to allow searching HTML code, like: `/Hello <span>world<\/span>/`

Regex search #6706

Regex search #6706

Conversation

Alkarex commented Aug 12, 2024 • edited Loading

mtalexan commented Aug 27, 2024 • edited Loading

Bugs

Pattern Parenthesis

Multi-line mode

Alkarex commented Aug 27, 2024

mtalexan commented Aug 28, 2024 • edited Loading

mtalexan commented Aug 28, 2024

mtalexan commented Aug 28, 2024

Alkarex commented Aug 28, 2024

mtalexan commented Aug 28, 2024

mtalexan commented Aug 28, 2024

Alkarex commented Aug 28, 2024

Alkarex commented Aug 28, 2024 • edited Loading

Alkarex commented Aug 28, 2024

Alkarex commented Aug 30, 2024

mtalexan commented Sep 2, 2024

Alkarex commented Sep 2, 2024

Alkarex commented Sep 2, 2024

mtalexan Sep 6, 2024

Choose a reason for hiding this comment

Alkarex Sep 6, 2024

Choose a reason for hiding this comment

mtalexan Sep 7, 2024

Choose a reason for hiding this comment

Alkarex commented Sep 7, 2024

arseru commented Oct 8, 2024 • edited Loading

Alkarex commented Oct 8, 2024

arseru commented Oct 8, 2024 • edited Loading

Alkarex commented Oct 21, 2024

Alkarex commented Aug 12, 2024 •

edited

Loading

mtalexan commented Aug 27, 2024 •

edited

Loading

mtalexan commented Aug 28, 2024 •

edited

Loading

Alkarex commented Aug 28, 2024 •

edited

Loading

arseru commented Oct 8, 2024 •

edited

Loading

arseru commented Oct 8, 2024 •

edited

Loading