-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix edge cases in the parsing of quoted values #630
Conversation
tests/test_grammar.py
Outdated
@@ -137,6 +137,23 @@ | |||
("str_quoted_too_many_1", "''a'", GrammarParseError), | |||
("str_quoted_too_many_2", "'a''", GrammarParseError), | |||
("str_quoted_too_many_3", "''a''", GrammarParseError), | |||
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the obvious first test for trailing backslash. do you have it somewhere else?
("str_quoted_trailing_backslash", "'abc\\'", "abc\"),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, I realized such tests were missing when working on #621 (I agree this one should have been there earlier, and some of the other ones as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the missing tests here?
They will make it easier for me to form a mental model of what the behavior is after this change.
some input strings single quoted strings.
''
'a'
'\a'
'a\'
'"'
'\''
'c:\\''
Here are some nested quoted strings cases, some may be handled in #621:
'${f:"b"}' # nested double quotes
'${f:\'b\'}' # nested escaped single quotes
Same as above, but for an escaped interpolation. what is the expected behavior here?
enerally speaking, is parsing inside interpolations the same as parsing outside?
'\${f:"b"}'
'\${f:\'b\'}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the missing tests here?
Oh, I now realize that I had probably misunderstood your previous question "This seems like the obvious first test for trailing backslash. do you have it somewhere else?" and as a result my answer was a bit off.
The example you gave:
("str_quoted_trailing_backslash", "'abc\\'", "abc\"),
is actually invalid because "'abc\\'"
is the string with characters 'abc\'
and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).
The test for a trailing backslash is the one on the line you highlighted:
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
where "'abc\\\\'"
is the string with characters 'abc\\'
which is now a valid quoted string.
(some examples)
I added all your examples in 7f23c1b
(some are named with "tmp" because they are redundant with existing tests, I still kept them here to make it easier for you to check them out, but I will remove them afterwards)
All of them behave as expected -- or at least how I would expect :)
Generally speaking, is parsing inside interpolations the same as parsing outside?
Depends on what you mean. Generally not, because outside of interpolations we do nothing except un-escaping (because of escaped interpolations) and otherwise just copy any character that we see (there are no constraints). But when we are inside an interpolation (i.e., parsing resolver arguments), the logic is different (to detect dicts, lists, quoted strings, etc.)
But when we enter a quoted string (which would be found inside an interpolation) we start parsing its content as if we were outside of an interpolation (it's the actual implementation: we re-start a parsing from scratch on the content of the quoted string -- after the un-escaping pass for escaped quotes). In that sense, the parsing of a quoted string inside an interpolation is the same as the parsing outside of interpolations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is actually invalid because "'abc\'" is the string with characters 'abc' and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).
I think I meant the characters 'abc\\'
. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?
Thanks for clarifying the parsing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I meant the characters
'abc\\'
. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?
Ok so the string with characters 'abc\\'
is the one from the same test:
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
The reason why I didn't use r-strings is because the last string ("abc\\"
) can't be converted into an r-string, and I found that using an r-string only for the first one could be confusing (when trying to compare before/after). It would look like this:
("str_quoted_trailing_esc_1", r"'abc\\'", "abc\\"),
which I find a bit confusing because at first sight it looks like we have the same number of \
in both strings.
tests/test_grammar.py
Outdated
@@ -137,6 +137,23 @@ | |||
("str_quoted_too_many_1", "''a'", GrammarParseError), | |||
("str_quoted_too_many_2", "'a''", GrammarParseError), | |||
("str_quoted_too_many_3", "''a''", GrammarParseError), | |||
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), | |||
("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"), | |||
("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this is interpreted as abc\def
in the current grammar, right?
I understand this can simplify things, but I think there is value in simple and consistent handling of backslash escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that all backslashes should be escaped? We could enforce that, but combined with the escaping in Python strings and the current "double un-escaping" issue (#615), this could make things quite ugly:
OmegaConf.register_new_resolver("listdir", os.listdir)
cfg = OmegaConf.create({
"all_files": "${listdir: '${drive}:\\\\\\\\path\\\\\\\\to\\\\\\\\folder'}",
"drive": "C",
})
cfg.all_files # will list content of C:\path\to\folder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)
The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.
The double unescaping is an implementation detail, I am not sure it's really needed to make things work.
With those two statements, the above should become which seems reasonable to me.
cfg = OmegaConf.create({
"all_files": "${listdir: r'${drive}:\\path\\to\\folder'}",
"drive": "C",
})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)
Not sure about Hydra. But in OmegaConf a single backslash is ok: the fact that we un-escape double \\
into \
doesn't mean we have to touch single \
.
The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.
Maybe. There are a few places where raw strings don't work though, and there's also shell escaping (e.g. with Hydra).
The double unescaping is an implementation detail, I am not sure it's really needed to make things work.
I'm not sure either but at this point I've given up on fixing it (besides going the other way that you're suggesting, i.e. what I did in #621 where we only escape backslashes before quotes or interpolations).
If you have an idea that you believe may work, let me know what it is and I can try to either implement it or show you why it doesn't work :)
With those two statements, the above should become which seems reasonable to me. (...)
Just a note that I fixed your example (the r
was misplaced) -- it doesn't change your point.
I think that forcing double \\
should be part of a different PR, if that's what we want to do.
Edit: as a minor note, this would be a breaking change to existing string interpolations in 2.0: {"path": "${drive}:\test"}
, as I'm assuming that if we consistently enforce \\
in quoted strings, we should also do it in top-level strings (i.e. outside of interpolations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)
Not sure about Hydra. But in OmegaConf a single backslash is ok: the fact that we un-escape double
\\
into\
doesn't mean we have to touch single\
.
Ok.
The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.
Maybe. There are a few places where raw strings don't work though, and there's also shell escaping (e.g. with Hydra).
Escaping is not a thing in bash, They do have a syntax for literal strings though:
https://unix.stackexchange.com/questions/30903/how-to-escape-quotes-in-shell
The double unescaping is an implementation detail, I am not sure it's really needed to make things work.
I'm not sure either but at this point I've given up on fixing it (besides going the other way that you're suggesting, i.e. what I did in #621 where we only escape backslashes before quotes or interpolations).
If you have an idea that you believe may work, let me know what it is and I can try to either implement it or show you why it doesn't work :)
Can you create a small example and explain how it leads to double unescaping?
This will help me play with it.
With those two statements, the above should become which seems reasonable to me. (...)
Just a note that I fixed your example (the
r
was misplaced) -- it doesn't change your point.I think that forcing double
\\
should be part of a different PR, if that's what we want to do.Edit: as a minor note, this would be a breaking change to existing string interpolations in 2.0:
{"path": "${drive}:\test"}
, as I'm assuming that if we consistently enforce\\
in quoted strings, we should also do it in top-level strings (i.e. outside of interpolations)
I don't think we should make big changes (if any) outside of quoted strings and I think I am convinced we should not require escaping of free backslash if we can make it work and be consistent.
I introduced escaping in the Hydra grammar to support single quote in a single quoted string and double quote in a double quoted string.
The introduction of nested interpolations added more use cases to escaping:
- Nesting of quoted strings for the purposes of nested interpolations.
This is a pretty rare one. and in most cases using a different quote style should be enough.
'hello ${identity:"James Bond"}'
# however this turns weird if you have another level:
'hello ${identity:"James ${identity:"bond"}"}'
# or this?
'hello ${identity:"James ${identity:'bond'}"}'
# Both are wrong in their own way.
# But this should probably work:
'hello ${identity:"James ${identity:\'bond\'}"}'
The above does not require additional double escaping because we are using the other quote style as well.
If we didn't:
'hello ${identity:\'James ${identity:\\'bond\\'}'\}'
- Escaping the interpolation itself:
\${aa}
\${foo.bar}
\${oc.env:USER}
multiple escaping should never be needed here. I think it's required because of your current implementation of the unescaping mechanism.
# no escaping:
${foo:${bar:zoo}}
# both
\${foo:\${bar:zoo}}
# inner only:
${foo:\${bar:zoo}}
This can be potentially be handled by introducing a token like ESC_INTER_OPEN: \${
By the way, this should also work outside of quoted strings because support interpolations in unquoted strings.
PS: None of the examples above are tested so mistakes are likely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).
Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?
In interpretation 1, how do you write b so that it resolves to \10?
No can do.
Do you have a test that ensuring that this is covered?
(in interpretation 2, you can write b: '\${a}' so that it resolves to ${a})
Can you explain the process getting from \\\${a}
to \${a}
(in the new code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).
Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?
I had a long reply with a detailed example and my laptop crashed on me :/ So here's the short version, let me know if unclear:
- "The new code" (in this PR) is just a grammar change (it doesn't change how the visitor works)
- Escaped interpolations
\${
are only un-escaped in top-level strings, so they get un-escaped only once. However, currently the second phase may indeed trigger multiple un-escaping of\\
into\
in case of nested quoted strings (this is one thing [ON HOLD] Simpler escaping in the grammar, and fix to quoted values #621 was meant to avoid).
In interpretation 1, how do you write b so that it resolves to \10?
No can do.
Do you have a test that ensuring that this is covered?
Yes, this is this test:
("str_top_esc_backslash", r"Esc: \\${str}", r"Esc: \hi"),
(in interpretation 2, you can write b: '${a}' so that it resolves to ${a})
Can you explain the process getting from
\\\${a}
to\${a}
(in the new code).
It's the same in current master:
- The lexer turns
\\\${a}
into this sequence of tokens: ESC (matching\\
), ESC_INTER (matching\${
), TOP_STR (machinga}
- This is recognized as a
toplevelStr
by the parser - The visitor code visiting this node is simply
return self._unescape(ctx.getChildren())
which will concatenate the un-escaped string representation of these 3 tokens, i.e.\
,${
anda}
, yielding the final result\${a}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).
Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?I had a long reply with a detailed example and my laptop crashed on me :/ So here's the short version, let me know if unclear:
- "The new code" (in this PR) is just a grammar change (it doesn't change how the visitor works)
- Escaped interpolations
\${
are only un-escaped in top-level strings, so they get un-escaped only once. However, currently the second phase may indeed trigger multiple un-escaping of\\
into\
in case of nested quoted strings (this is one thing [ON HOLD] Simpler escaping in the grammar, and fix to quoted values #621 was meant to avoid).
Okay, so once this is in, #621 is supposed to address the repeated un-escaping issue?
Can you explain the process getting from
\\\${a}
to\${a}
(in the new code).It's the same in current master:
- The lexer turns
\\\${a}
into this sequence of tokens: ESC (matching\\
), ESC_INTER (matching\${
), TOP_STR (machinga}
- This is recognized as a
toplevelStr
by the parser- The visitor code visiting this node is simply
return self._unescape(ctx.getChildren())
which will concatenate the un-escaped string representation of these 3 tokens, i.e.\
,${
anda}
, yielding the final result\${a}
Gotcha, this answers my question.
Can you explain why we need a toplevelStr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so once this is in, #621 is supposed to address the repeated un-escaping issue?
That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).
Can you explain why we need a toplevelStr?
This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said
. Here, there would be two top-level strings: "Hi
and ", he said
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).
Cool. I think we should try to fix it. I am asking all those questions because I am trying to reach the best solution, not because I think this is not important enough to fix.
This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said. Here, there would be two top-level strings: "Hi and ", he said.
Gotcha.
tests/test_grammar.py
Outdated
@@ -137,6 +137,23 @@ | |||
("str_quoted_too_many_1", "''a'", GrammarParseError), | |||
("str_quoted_too_many_2", "'a''", GrammarParseError), | |||
("str_quoted_too_many_3", "''a''", GrammarParseError), | |||
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), | |||
("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"), | |||
("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).
Cool. I think we should try to fix it. I am asking all those questions because I am trying to reach the best solution, not because I think this is not important enough to fix.
This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said. Here, there would be two top-level strings: "Hi and ", he said.
Gotcha.
tests/test_grammar.py
Outdated
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), | ||
("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"), | ||
("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"), | ||
("str_quoted_no_esc_2", '"abc\\\\def"', "abc\\def"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make those tests easier to understand, can you use raw strings?
("str_quoted_no_esc_2", '"abc\\\\def"', "abc\\def"), | |
("str_quoted_no_esc_2", r'"abc\\def"', r"abc\def"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 3457d0c
(note: one thing a bit annoying with raw strings is they don't "work" for strings that end with a \
or contain an escaped quote of the same quote type as the enclosing quotes, e.g. "abc\"def"
=> this is why there still remain some non-raw strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests looks good.
I am starting to think that maybe we should drive those specific tests from an input text file to eliminate the Python shenanigans.
I added a comment about possibly adding more tests here. (maybe they already covered).
tests/test_grammar.py
Outdated
@@ -137,6 +137,23 @@ | |||
("str_quoted_too_many_1", "''a'", GrammarParseError), | |||
("str_quoted_too_many_2", "'a''", GrammarParseError), | |||
("str_quoted_too_many_3", "''a''", GrammarParseError), | |||
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the missing tests here?
They will make it easier for me to form a mental model of what the behavior is after this change.
some input strings single quoted strings.
''
'a'
'\a'
'a\'
'"'
'\''
'c:\\''
Here are some nested quoted strings cases, some may be handled in #621:
'${f:"b"}' # nested double quotes
'${f:\'b\'}' # nested escaped single quotes
Same as above, but for an escaped interpolation. what is the expected behavior here?
enerally speaking, is parsing inside interpolations the same as parsing outside?
'\${f:"b"}'
'\${f:\'b\'}'
("str_quoted_tmp_2", r"'a\'", GrammarParseError), | ||
("str_quoted_inside_quote_different", "'\"'", '"'), | ||
("str_quoted_inside_quote_same", r"'\''", "'"), | ||
("str_quoted_extra_quote", r"'c:\\''", GrammarParseError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works?
("str_quoted_extra_quote", r"'c:\\''", GrammarParseError), | |
("str_quoted_trailing_backslash", r"'c:\\'", "c:\"), | |
("str_quoted_extra_quote", r"'c:\\''", GrammarParseError), |
Loos like rstrings are working for it:
print(r"'c:\\'")
'c:\\'
I would expect \\
to be interpreted as a token equal to \
, so the content of the quoted string should be c:\
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would work (after fixing "c:\"
which isn't a valid Python string) but there's already a similar test so no need to add it:
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
r-strings work as long as the "escaped" quote isn't the same type as the enclosing quote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That test is similar, but is testing for a double trailing backslash.
Since this is all about backslash and escaping, I think an additional basic test for a single backslash is worth it.
And now I realize that the double backslash is due to Python interpretation of escaping (leaving it as is to make the point of how confusing this is that it trips me even when I am fully in context).
Why not use r-strings here?
In [3]: print(r"'abc\\'")
'abc\\'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a bit tricky to get those \
right :)
Why not use r-strings here?
I explained my reasoning in #630 (comment) (I don't mind changing it though if you like the "partial" r-string version better)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, let's use r-strings whenever possible, where it's not possible add a comment.
This part of the tests could be better tested by reading raw data from a file instead of encoding the data data in Python but I am not sure it's worth it to do it just for the escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done in 9011928
I used a couple of tricks to use raw strings everywhere -- I can revert those if you don't like them:
Using .strip()
for strings ending in with a \
, e.g.
("str_quoted_trailing_esc_1", r"'abc\\'", r" abc\ ".strip()),
(technically the leading whitespace is not needed, but I find it looks better)
Using an f-string to include tabs in r-strings, e.g.
("str_esc_ws_2", fr"\ \{TAB}\{TAB}", f" {TAB}{TAB}"),
where TAB
is a constant defined as
TAB = "\t" # to be used in raw strings, e.g. `fr"C:\{TAB}foo"`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reviewing I was scratching my head a bit before figuring out the strip trick.
I think it's fine, but add a comment.
maybe something like:
# raw strings does not allow trailing \, adding a space and stripping it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 6015a71
tests/test_grammar.py
Outdated
@@ -137,6 +137,23 @@ | |||
("str_quoted_too_many_1", "''a'", GrammarParseError), | |||
("str_quoted_too_many_2", "'a''", GrammarParseError), | |||
("str_quoted_too_many_3", "''a''", GrammarParseError), | |||
("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is actually invalid because "'abc\'" is the string with characters 'abc' and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).
I think I meant the characters 'abc\\'
. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?
Thanks for clarifying the parsing logic.
7f23c1b
to
0d7b358
Compare
Just rebased on top of master to solve a (minor) merge conflict |
aa8110f
to
9011928
Compare
will take a look soon, this week is the week of PTED21 so we have been focused on it. |
("str_quoted_tmp_2", r"'a\'", GrammarParseError), | ||
("str_quoted_inside_quote_different", "'\"'", '"'), | ||
("str_quoted_inside_quote_same", r"'\''", "'"), | ||
("str_quoted_extra_quote", r"'c:\\''", GrammarParseError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reviewing I was scratching my head a bit before figuring out the strip trick.
I think it's fine, but add a comment.
maybe something like:
# raw strings does not allow trailing \, adding a space and stripping it.
tests/test_grammar.py
Outdated
@@ -227,10 +253,14 @@ | |||
# Interpolations in quoted strings. | |||
("str_quoted_inter", "'${null}'", "None"), | |||
("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"), | |||
("str_quoted_esc_single_2", "'\"\\\\\\\\\\${foo}\\ '", r'"\${foo}\ '), | |||
("str_quoted_esc_single_2", r"""'"\\\\\${foo}\ '""", r'"\${foo}\ '), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this one? I find it surprising that we need 5 slashes before the $ and 1 slash after the }.
Is this something that will be addressed by the followup double escaping diff? if so, what will it look like after we eliminate the double unescaping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is related to the double-unescaping. Starting from 5 backslashes:
- First un-escaping of quoted string:
\\\\\
==>\\\
- Second un-escaping of top-level strings for escaped interpolations:
\\\${
=>\${
The slash after the }
wasn't really useful to test I think, so I removed it in b678b7c (at the same time, I removed the extra quote at the beginning of the string and instead added another test specifically for quotes, which is clearer)
@@ -250,7 +280,7 @@ | |||
("nested_select", "${options.${choice}}", "A"), | |||
("nested_select_getitem", "${options[${choice}]}", "A"), | |||
("nested_relative", "${${rel_opt}.b}", "B"), | |||
("str_quoted_nested", r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'", "ABCDEFGH"), | |||
("str_quoted_nested", r"'AB${test:\'CD${test:\\\'EF\\\'}GH\'}'", "ABCDEFGH"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this one?
Is this something that will be addressed by the followup double escaping diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I would expect this one to remain unchanged. Here's what happens:
- Un-escaping of quoted string:
AB${test:\'CD${test:\\\'EF\\\'}GH\'}
==>AB${test:'CD${test:\'EF\'}GH'}
- Un-escaping of top-level strings: no change because the only top-level string is
AB
, the rest is an interpolation and interpolations aren't touched in this step - We then start processing the interpolation
- Un-escaping of quoted string:
CD${test:\'EF\'}GH
==>CD${test:'EF'}GH
- Un-escaping of top-level strings: no change (the top-level strings are
CD
andGH
) - We then process the interpolation which is a normal one (yielding the string
EF
)
The reason why it worked before with fewer backslashes in the middle is because a string like 'foo\\'bar'
was accepted as a valid quoted string by the grammar, while this is not the case anymore: now this would be parsed as a quoted string 'foo\\'
followed by bar'
which is an invalid expression. I don't think we could have kept the previous behavior without facing the subtle bugs of #617.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- So the reason we are unescaping in top level strings is just to support escaping interpolations there?
If so, would it simplify things to just say that we only apply the grammar inside interpolations and thus only support escaping of nested interpolations? - I am a bit confused by this:
input: r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'"
actual string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'
This is a string outside of interpolation, and it's quoted. why is the result not quoted as well?
I am assuming that this string is interpreted as a value of a config node, so there is no yaml involved here.
- I don't like how we need to unescape the nested quoted strings.
feels like it adds a lot of complexity to the mental model required to understand the behavior of quoted strings. before diving into this one, I want to understand 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- So the reason we are unescaping in top level strings is just to support escaping interpolations there?
Yes.
If so, would it simplify things to just say that we only apply the grammar inside interpolations and thus only support escaping of nested interpolations?
I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR}
if we can't escape interpolations in top-level strings?
- I am a bit confused by this:
input: r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'" actual string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'
This is a string outside of interpolation, and it's quoted. why is the result not quoted as well?
I am assuming that this string is interpreted as a value of a config node, so there is no yaml involved here.
Oh! Careful, there are several tests in the file, and this one belongs to the tests for the singleElement
parser rule (not configValue
), so this is parsed as a quoted string like you would use as a resolver argument.
- I don't like how we need to unescape the nested quoted strings.
feels like it adds a lot of complexity to the mental model required to understand the behavior of quoted strings. before diving into this one, I want to understand 2.
Let me know how you feel now that hopefully point 2 is clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR} if we can't escape interpolations in top-level strings?
How would you do it in OmegaConf 2.0 (not supported at all as far as I can tell).
In principle, you could use decode (although I think it's not working right now for this)
a: ${oc.decode:'Please use \${VAR}'}
Oh! Careful, there are several tests in the file, and this one belongs to the tests for the singleElement parser rule (not configValue), so this is parsed as a quoted string like you would use as a resolver argument.
This is one of the reasons I don't like the current testing approach: it makes it hard to understand what the test data is being used for.
Let me know how you feel now that hopefully point 2 is clearer.
Okay, so back to nested quoted strings:
outer quoted string:
'AB${test:\'CD${test:\\'EF\\'}GH\'}'
At some point we considered using lexer modes to support something like this:
'AB${test:'CD${test:'EF'}GH'}'
(Correct me if I am wrong and lexer modes will not actually enable this).
This is becoming much more attractive after the back and forth we had on this issue.
Can we pull this off?
With the exception of the time investment, do you see any downsides to suspect additional corner cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR} if we can't escape interpolations in top-level strings?
How would you do it in OmegaConf 2.0 (not supported at all as far as I can tell).
Indeed, that was a gap in 2.0. Though you could use
${}
as long as it didn't look like an interpolation, for instance"Please use ${VAR!}"
would work.
This is one of many enhancements in 2.1, but it's not even a requested one so I would be okay with an alternative implementation if it's reasonably clean and simplify out lives.
In principle, you could use decode (although I think it's not working right now for this)
a: ${oc.decode:'Please use \${VAR}'}
The example you provide doesn't work because the string that is given as input to
oc.decode
has charactersPlease use ${VAR}
so it crashes trying to resolve the interpolation (inoc.decode
).
This is a feature of oc.decode:
- I am not sure why it should resolve the interpolation in the first place.
- the interpolation is escaped, so it should understand that and not try to resolve it (in principle, I am not talking about the current implementation).
I'm actually surprised that it's allowed by the grammar because I didn't realize that you could have interpolations in unquoted strings (
oc.decode
uses thesingleElement
parser rule). So for instance you can write this:
I used a quoted string, so I am not sure why you were surprised.
cfg = OmegaConf.create({"x": "foo", "y": "${identity:abc ${x}}"}) assert cfg.y == "abc foo"I'm not sure this is even tested right now, I'll double check later.
I have done it before, I usually quote just for clarity in scenarios where the string contains spaces.
Let me know if you find why this works :).
This works with
oc.decode
:cfg = OmegaConf.create({"x": r"""${oc.decode:'"Please use \\\\\${VAR}"'}"""}) assert cfg.x == "Please use ${VAR}"That being said, even if we decide that this is how to do it, it still relies on the escaping of interpolations at the top-level, because the parsing of a quoted string triggers a top-level parsing after the removal of quotes & un-escaping. So I'm not sure what we would gain.
Obviously this is not great. but if we can make a: ${oc.decode:'Please use \${VAR}'}
work I would vote to simplify top-level string and maybe even drop it completely.
Okay, so back to nested quoted strings:
outer quoted string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'
At some point we considered using lexer modes to support something like this:
'AB${test:'CD${test:'EF'}GH'}'
(Correct me if I am wrong and lexer modes will not actually enable this).
This is becoming much more attractive after the back and forth we had on this issue.
Can we pull this off?
With the exception of the time investment, do you see any downsides to suspect additional corner cases?Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:
'AB${test:"CD${test:'EF'}GH"}'
I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).
I think it should work fine (it used to), just flagging that it means we would now parse the content of quoted strings when we parse an expression (not really a problem, just to make it clear), instead of triggering this parsing only on evaluation.
The main downside I see is the one you had raised initially: it may be an unusual approach that could surprise people.
After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, you could use decode (although I think it's not working right now for this)
a: ${oc.decode:'Please use \${VAR}'}
The example you provide doesn't work because the string that is given as input to
oc.decode
has charactersPlease use ${VAR}
so it crashes trying to resolve the interpolation (inoc.decode
).This is a feature of oc.decode:
- I am not sure why it should resolve the interpolation in the first place.
- the interpolation is escaped, so it should understand that and not try to resolve it (in principle, I am not talking about the current implementation).
-
oc.decode
currently relies on the same parsing (& visitor) code as the one used to parse (& resolve) thesingleElement
rule of the grammar, which resolves interpolations. -
But
oc.decode
doesn't see the escaped interpolation: it sees the string with charactersPlease use ${VAR}
where the interpolation isn't escaped. The reason is that the quoted string'Please use \${VAR}'
is first resolved intoPlease use \${VAR}
then passed tooc.decode
. So what you're trying to achieve would be done in a simpler way with anidentity
resolver that would return this string unchanged (instead ofoc.decode
which parses it according to the grammar rules).
I'm actually surprised that it's allowed by the grammar because I didn't realize that you could have interpolations in unquoted strings (
oc.decode
uses thesingleElement
parser rule). So for instance you can write this:I used a quoted string, so I am not sure why you were surprised.
Hopefully the above comment makes it clearer: oc.decode
sees the string with characters Please use ${VAR}
which is unquoted.
cfg = OmegaConf.create({"x": "foo", "y": "${identity:abc ${x}}"}) assert cfg.y == "abc foo"I'm not sure this is even tested right now, I'll double check later.
I have done it before, I usually quote just for clarity in scenarios where the string contains spaces.
Let me know if you find why this works :)
Oh, a quick look at the grammar makes it clear why it works. I was just surprised because for some reason I thought it wouldn't. I just want to make sure it's properly tested.
This works with
oc.decode
:cfg = OmegaConf.create({"x": r"""${oc.decode:'"Please use \\\\\${VAR}"'}"""}) assert cfg.x == "Please use ${VAR}"That being said, even if we decide that this is how to do it, it still relies on the escaping of interpolations at the top-level, because the parsing of a quoted string triggers a top-level parsing after the removal of quotes & un-escaping. So I'm not sure what we would gain.
Obviously this is not great. but if we can make
a: ${oc.decode:'Please use \${VAR}'}
work I would vote to simplify top-level string and maybe even drop it completely.
I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of \\
from quoted strings (as in #621) then we can probably shave off a few \
from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.
At some point we considered using lexer modes to support something like this:
'AB${test:'CD${test:'EF'}GH'}'
Can we pull this off?
Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:
'AB${test:"CD${test:'EF'}GH"}'
I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).
Oh yes you are right, thanks! I'm not sure where this idea of alternating came from -- for some reason I thought I remembered you had to alternate, but I think I just imagined it.
After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.
Alright, I'm happy to do it, but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance "there is a single quote ' in the middle"
. I haven't thought in depth about it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- But
oc.decode
doesn't see the escaped interpolation: it sees the string with charactersPlease use ${VAR}
where the interpolation isn't escaped. The reason is that the quoted string'Please use \${VAR}'
is first resolved intoPlease use \${VAR}
then passed tooc.decode
. So what you're trying to achieve would be done in a simpler way with anidentity
resolver that would return this string unchanged (instead ofoc.decode
which parses it according to the grammar rules).
The line of reasoning here seems wrong or missing a step.
But in any case, we shouldn't force decode to do what its' not meant to do.
IF identity works, we can introduce it under some name that makes sense. The use case for not resolving interpolations is very rare and I don't mind if it takes a bit of extra work.
Obviously this is not great. but if we can make
a: ${oc.decode:'Please use \${VAR}'}
work I would vote to simplify top-level string and maybe even drop it completely.I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of
\\
from quoted strings (as in #621) then we can probably shave off a few\
from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.
Can you explain what you mean by "since the parsing of quoted strings relies on it as well"?
As far as I understand, quoted strings are only special inside interpolations.
Outside interpolations we can have whatever we want, including quotes.
At some point we considered using lexer modes to support something like this:
'AB${test:'CD${test:'EF'}GH'}'
Can we pull this off?
Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:
'AB${test:"CD${test:'EF'}GH"}'
I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).Oh yes you are right, thanks! I'm not sure where this idea of alternating came from -- for some reason I thought I remembered you had to alternate, but I think I just imagined it.
Maybe alternating was something we considered before we found out about lexer modes.
After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.
Alright, I'm happy to do it,
Great, so just to make sure I understand:
In the case of something like:
${foo: 'outer ${bar: 'inner' }' }
To be clear, this is the inner interpolation: ${bar: 'inner' }
It lives in the quotes string 'outer ${bar: 'inner' }'
,
which is a parameter of the resolver ${foo: 'outer ${bar: 'inner' }' }
What would the tokens be be?
Does it mean that we are promoting quoted string from a token to a parser rule?
but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance
"there is a single quote ' in the middle"
. I haven't thought in depth about it though.
I think we still need those: how would you create a quoted string that contains both quotes?
single: ', double: "
In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to follow-up on:
Oh, a quick look at the grammar makes it clear why it works. I was just surprised because for some reason I thought it wouldn't. I just want to make sure it's properly tested.
==> It's indeed properly tested (there is a section "Interpolations in unquoted strings"), so no worries there (except for my failing memory).
- But
oc.decode
doesn't see the escaped interpolation: it sees the string with charactersPlease use ${VAR}
where the interpolation isn't escaped. The reason is that the quoted string'Please use \${VAR}'
is first resolved intoPlease use \${VAR}
then passed tooc.decode
. So what you're trying to achieve would be done in a simpler way with anidentity
resolver that would return this string unchanged (instead ofoc.decode
which parses it according to the grammar rules).The line of reasoning here seems wrong or missing a step.
It may become clearer with a simpler example:
cfg = OmegaConf.create({"x": "foo", "y": "${identity:'x is ${x}'}"})
cfg.y # 'x is foo'
What happens is that in the visitor, when we are to resolve ${identity:'x is ${x}'}
in visitInterpolationResolver()
, we first obtain the name of the resolver (identity
) then obtain the list of arguments to the resolver from visitSequence()
, which calls visitElement()
on each argument. The quoted value 'x is ${x}'
is a primitive and its resolution triggers a new parsing pass over its content (x is ${x}
) where the interpolation is resolved, resulting in the string x is foo
. Then this string is passed to the resolver function associated to identity
, which returns it unchanged.
The main point here is that the quoted string is evaluted before it is passed to the resolver. So, now, if you escape the interpolation:
cfg = OmegaConf.create({"x": "foo", "y": r"${identity:'x is \${x}'}"})
cfg.y # 'x is ${x}'
==> the exact same steps are applied, except that the resolution of the quoted value 'x is \${x}'
results in the string x is ${x}
which is passed as input to the resolver function associated to identity
.
Hopefully this makes it clearer why oc.deode
sees the string Please use ${VAR}
as input in our previous example.
Obviously this is not great. but if we can make
a: ${oc.decode:'Please use \${VAR}'}
work I would vote to simplify top-level string and maybe even drop it completely.I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of
\\
from quoted strings (as in #621) then we can probably shave off a few\
from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.Can you explain what you mean by "since the parsing of quoted strings relies on it as well"?
As far as I understand, quoted strings are only special inside interpolations.
Outside interpolations we can have whatever we want, including quotes.
What I mean is that the resolution of a quoted string currently works as follows:
- Remove the enclosing quotes (let's say these are double quotes
"
) - Un-escape
\\
and\"
(not escaped interpolations) - Trigger a new parsing + resolution of the resulting string with the
configValue
rule (which relies on top-level strings)
This will change with the changes discussed below though, but maybe we can discuss it in another PR related to these changes.
Maybe alternating was something we considered before we found out about lexer modes.
Great, so just to make sure I understand:
In the case of something like:
${foo: 'outer ${bar: 'inner' }' }
To be clear, this is the inner interpolation:
${bar: 'inner' }
It lives in the quotes string'outer ${bar: 'inner' }'
,
which is a parameter of the resolver${foo: 'outer ${bar: 'inner' }' }
What would the tokens be be?
Does it mean that we are promoting quoted string from a token to a parser rule?
Yes.
I have a backup of my old branch which should make things clear:
Lexer: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationLexer.g4
Parser: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationParser.g4
but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance
"there is a single quote ' in the middle"
. I haven't thought in depth about it though.I think we still need those: how would you create a quoted string that contains both quotes?
Yes indeed, I thought of the same example afterwards. Although I'm fine with keeping escaped quotes, just note that since it's a pretty rare situation we could also instead use resolvers (and we could do the same for escaped interpolations). Something like:
"single ' quote, double ${oc.esc.quote:double} quote, escaped ${oc.esc.interpolation:open} interpolation }"
which would be resolved into single ' quote, double " quote, escaped ${ interpolation }
.
(just throwing it out there as an idea, I don't feel strongly either way)
In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.
Agreed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that the resolution of a quoted string currently works as follows:
- Remove the enclosing quotes (let's say these are double quotes
"
)- Un-escape
\\
and\"
(not escaped interpolations)- Trigger a new parsing + resolution of the resulting string with the
configValue
rule (which relies on top-level strings)This will change with the changes discussed below though, but maybe we can discuss it in another PR related to these changes.
Yes, it will be easier to discuss in context, and we are considering implementation changes to this anyway.
Does it mean that we are promoting quoted string from a token to a parser rule?
Yes.
I have a backup of my old branch which should make things clear:
Lexer: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationLexer.g4
Parser: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationParser.g4
Great.
but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance
"there is a single quote ' in the middle"
. I haven't thought in depth about it though.
I think we still need those: how would you create a quoted string that contains both quotes?
Yes indeed, I thought of the same example afterwards. Although I'm fine with keeping escaped quotes, just note that since it's a pretty rare situation we could also instead use resolvers (and we could do the same for escaped interpolations). Something like:"single ' quote, double ${oc.esc.quote:double} quote, escaped ${oc.esc.interpolation:open} interpolation }"
I agree that it's a rare case. but the standard approach for it (in most grammars) is to escape and it might be worth supporting directly (people would intuitively try to escape a quote like that).
My gut feeling is that if we switch to the proposed approach, the rare cases where we need to escape will be much easier to understand, so they would no longer be a concern.
In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.
Agreed.
Will make a final pass soon, not sure I will have time today (Friday is when I have the most meetings).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You know what, I think this is good to go.
Go ahead.
Fixes #617
Let's agree on this fix before continuing the discussion in #621.