-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: correct regex pattern for SELECT clause extraction #123
Conversation
8d55fce
to
8144471
Compare
8144471
to
47bc4d2
Compare
rdfproxy/utils/sparql_utils.py
Outdated
"""Replace the SELECT clause of a query with repl.""" | ||
if re.search(r"select\s.+", query, re.I) is None: | ||
pattern = r"(select\s+[\s\S]*?)(?=\s+where)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the first group?
why do we need [\s\S]*?
why not .*?
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the first group?
True, the first group isn't strictly necessary. I found the regex easier to write that way and I also find it easier to read. But I can remove the group.
why do we need
[\s\S]*?
why not.*??
[\s\S]*?
is almost equivalent to .*?
but is more general because it also matches linebreaks without the re.DOTALL
flag.
E.g. (select\s.*?)(?=\s+where)
wouldn't match the SELECT clause in:
select
*
where {
?s ?p ?o . }
I don't know if this will even matter, because the planned query checking feature will normalize and trim incoming queries anyway (something that should definitely happen!), but even with sanitized queries I feel like the [\s\S]
variant is more exact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked, two tests fail with .*?
!
As mentioned, as soon as query checking and sanitization is in place, this won't be an issue anymore, but for this PR, .*?
is incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can choose between \s\S
and .
+ re.DOTALL
I'd prefer the latter
rdfproxy/utils/sparql_utils.py
Outdated
if re.search(r"select\s.+", query, re.I) is None: | ||
pattern = r"(select\s+[\s\S]*?)(?=\s+where)" | ||
|
||
if re.search(pattern=pattern, string=query, flags=re.IGNORECASE) is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have the stylistic changes in a separate PR. I think it is oke to move the pattern to pattern =
, but leave the other changes for another time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from fixing and assigning the regex pattern, I would see those changes more as a refactoring.
re.I
is synonymous with re.IGNORECASE
, kwargs are used instead of positional args and re.sub
is returned directly instead of binding and returning its result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want me to, I can open a separate PR though, no prob.
rdfproxy/utils/sparql_utils.py
Outdated
raise Exception("Unable to obtain SELECT clause.") | ||
|
||
count_query = re.sub( | ||
pattern=r"select\s.+", | ||
return re.sub( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, thats a stylistic change
rdfproxy/utils/sparql_utils.py
Outdated
repl=repl, | ||
string=query, | ||
count=1, | ||
flags=re.I, | ||
flags=re.IGNORECASE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
9278cfc
to
57b3385
Compare
The new regex non-greedily matches everything after "select" and before "where". "[\s\S]*?" basically means ".*?" but is more general because it also matches linebreaks without the re.DOTALL flag. See https://docs.python.org/3/library/re.html#re.DOTALL. Fixes #122.
The tests run variations of the example given in #122. Every test implemented here fails without the fix introduced with this PR.
57b3385
to
f644312
Compare
Fixes #122.
The new regex non-greedily matches everything after "select" and before "where".
[\s\S]*?
basically means.*?
but is more general because it alsomatches linebreaks without the
re.DOTALL
flag.See https://docs.python.org/3/library/re.html#re.DOTALL.