Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: correct regex pattern for SELECT clause extraction #123

Merged
merged 2 commits into from
Nov 8, 2024

Conversation

lu-pl
Copy link
Contributor

@lu-pl lu-pl commented Oct 28, 2024

Fixes #122.

The new regex non-greedily matches everything after "select" and before "where".

[\s\S]*?basically means .*? but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

@lu-pl lu-pl marked this pull request as draft October 28, 2024 13:04
@lu-pl lu-pl force-pushed the lupl/fix-select-regex branch from 8d55fce to 8144471 Compare November 6, 2024 16:40
@lu-pl lu-pl marked this pull request as ready for review November 6, 2024 16:41
@lu-pl lu-pl force-pushed the lupl/fix-select-regex branch from 8144471 to 47bc4d2 Compare November 6, 2024 16:48
@lu-pl lu-pl requested a review from b1rger November 6, 2024 16:50
rdfproxy/utils/sparql_utils.py Outdated Show resolved Hide resolved
"""Replace the SELECT clause of a query with repl."""
if re.search(r"select\s.+", query, re.I) is None:
pattern = r"(select\s+[\s\S]*?)(?=\s+where)"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the first group?
why do we need [\s\S]*? why not .*??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the first group?

True, the first group isn't strictly necessary. I found the regex easier to write that way and I also find it easier to read. But I can remove the group.

why do we need [\s\S]*? why not .*??

[\s\S]*? is almost equivalent to .*? but is more general because it also matches linebreaks without the re.DOTALL flag.

E.g. (select\s.*?)(?=\s+where) wouldn't match the SELECT clause in:

select

*
where {
?s ?p ?o . }

I don't know if this will even matter, because the planned query checking feature will normalize and trim incoming queries anyway (something that should definitely happen!), but even with sanitized queries I feel like the [\s\S] variant is more exact.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked, two tests fail with .*?!

As mentioned, as soon as query checking and sanitization is in place, this won't be an issue anymore, but for this PR, .*? is incorrect.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can choose between \s\S and . + re.DOTALL I'd prefer the latter

if re.search(r"select\s.+", query, re.I) is None:
pattern = r"(select\s+[\s\S]*?)(?=\s+where)"

if re.search(pattern=pattern, string=query, flags=re.IGNORECASE) is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have the stylistic changes in a separate PR. I think it is oke to move the pattern to pattern =, but leave the other changes for another time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from fixing and assigning the regex pattern, I would see those changes more as a refactoring.

re.I is synonymous with re.IGNORECASE, kwargs are used instead of positional args and re.sub is returned directly instead of binding and returning its result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want me to, I can open a separate PR though, no prob.

raise Exception("Unable to obtain SELECT clause.")

count_query = re.sub(
pattern=r"select\s.+",
return re.sub(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, thats a stylistic change

repl=repl,
string=query,
count=1,
flags=re.I,
flags=re.IGNORECASE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@lu-pl lu-pl force-pushed the lupl/fix-select-regex branch 3 times, most recently from 9278cfc to 57b3385 Compare November 8, 2024 07:23
lu-pl added 2 commits November 8, 2024 08:26
The new regex non-greedily matches everything after "select" and before "where".

"[\s\S]*?" basically means ".*?" but is more general because it also
matches linebreaks without the re.DOTALL flag.
See https://docs.python.org/3/library/re.html#re.DOTALL.

Fixes #122.
The tests run variations of the example given in #122.
Every test implemented here fails without the fix introduced with
this PR.
@lu-pl lu-pl force-pushed the lupl/fix-select-regex branch from 57b3385 to f644312 Compare November 8, 2024 07:26
@lu-pl lu-pl requested a review from b1rger November 8, 2024 08:34
@lu-pl lu-pl merged commit d8912cf into main Nov 8, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex in replace_query_select is buggy
2 participants