Fix edge cases in the parsing of quoted values #630

odelalleau · 2021-03-19T20:37:36Z

Fixes #617

Let's agree on this fix before continuing the discussion in #621.

omegaconf/grammar/OmegaConfGrammarLexer.g4

tests/test_grammar.py

omegaconf/grammar/OmegaConfGrammarLexer.g4

omry · 2021-03-23T21:22:51Z

tests/test_grammar.py

@@ -137,6 +137,23 @@
    ("str_quoted_too_many_1", "''a'", GrammarParseError),
    ("str_quoted_too_many_2", "'a''", GrammarParseError),
    ("str_quoted_too_many_3", "''a''", GrammarParseError),
+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),


This seems like the obvious first test for trailing backslash. do you have it somewhere else?

("str_quoted_trailing_backslash", "'abc\\'", "abc\"),

Nope, I realized such tests were missing when working on #621 (I agree this one should have been there earlier, and some of the other ones as well)

Can you add the missing tests here?
They will make it easier for me to form a mental model of what the behavior is after this change.
some input strings single quoted strings.

'' 'a' '\a' 'a\' '"' '\'' 'c:\\''

Here are some nested quoted strings cases, some may be handled in #621:

'${f:"b"}' # nested double quotes '${f:\'b\'}' # nested escaped single quotes

Same as above, but for an escaped interpolation. what is the expected behavior here?
enerally speaking, is parsing inside interpolations the same as parsing outside?

'\${f:"b"}' '\${f:\'b\'}'

Can you add the missing tests here?

Oh, I now realize that I had probably misunderstood your previous question "This seems like the obvious first test for trailing backslash. do you have it somewhere else?" and as a result my answer was a bit off.

The example you gave:

("str_quoted_trailing_backslash", "'abc\\'", "abc\"),

is actually invalid because "'abc\\'" is the string with characters 'abc\' and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).
The test for a trailing backslash is the one on the line you highlighted:

("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),

where "'abc\\\\'" is the string with characters 'abc\\' which is now a valid quoted string.

(some examples)

I added all your examples in 7f23c1b
(some are named with "tmp" because they are redundant with existing tests, I still kept them here to make it easier for you to check them out, but I will remove them afterwards)
All of them behave as expected -- or at least how I would expect :)

Generally speaking, is parsing inside interpolations the same as parsing outside?

Depends on what you mean. Generally not, because outside of interpolations we do nothing except un-escaping (because of escaped interpolations) and otherwise just copy any character that we see (there are no constraints). But when we are inside an interpolation (i.e., parsing resolver arguments), the logic is different (to detect dicts, lists, quoted strings, etc.)

But when we enter a quoted string (which would be found inside an interpolation) we start parsing its content as if we were outside of an interpolation (it's the actual implementation: we re-start a parsing from scratch on the content of the quoted string -- after the un-escaping pass for escaped quotes). In that sense, the parsing of a quoted string inside an interpolation is the same as the parsing outside of interpolations.

is actually invalid because "'abc\'" is the string with characters 'abc' and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).

I think I meant the characters 'abc\\'. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?

Thanks for clarifying the parsing logic.

I think I meant the characters 'abc\\'. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?

Ok so the string with characters 'abc\\' is the one from the same test:

("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),

The reason why I didn't use r-strings is because the last string ("abc\\") can't be converted into an r-string, and I found that using an r-string only for the first one could be confusing (when trying to compare before/after). It would look like this:

("str_quoted_trailing_esc_1", r"'abc\\'", "abc\\"),

which I find a bit confusing because at first sight it looks like we have the same number of \ in both strings.

omry · 2021-03-23T21:26:05Z

tests/test_grammar.py

@@ -137,6 +137,23 @@
    ("str_quoted_too_many_1", "''a'", GrammarParseError),
    ("str_quoted_too_many_2", "'a''", GrammarParseError),
    ("str_quoted_too_many_3", "''a''", GrammarParseError),
+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
+    ("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"),
+    ("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"),


As far as I understand, this is interpreted as abc\def in the current grammar, right?
I understand this can simplify things, but I think there is value in simple and consistent handling of backslash escaping.

Are you saying that all backslashes should be escaped? We could enforce that, but combined with the escaping in Python strings and the current "double un-escaping" issue (#615), this could make things quite ugly:

OmegaConf.register_new_resolver("listdir", os.listdir) cfg = OmegaConf.create({ "all_files": "${listdir: '${drive}:\\\\\\\\path\\\\\\\\to\\\\\\\\folder'}", "drive": "C", }) cfg.all_files # will list content of C:\path\to\folder

Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)

The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.
The double unescaping is an implementation detail, I am not sure it's really needed to make things work.

With those two statements, the above should become which seems reasonable to me.

cfg = OmegaConf.create({ "all_files": "${listdir: r'${drive}:\\path\\to\\folder'}", "drive": "C", })

Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)

Not sure about Hydra. But in OmegaConf a single backslash is ok: the fact that we un-escape double \\ into \ doesn't mean we have to touch single \.

The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.

Maybe. There are a few places where raw strings don't work though, and there's also shell escaping (e.g. with Hydra).

The double unescaping is an implementation detail, I am not sure it's really needed to make things work.

I'm not sure either but at this point I've given up on fixing it (besides going the other way that you're suggesting, i.e. what I did in #621 where we only escape backslashes before quotes or interpolations).
If you have an idea that you believe may work, let me know what it is and I can try to either implement it or show you why it doesn't work :)

With those two statements, the above should become which seems reasonable to me. (...)

Just a note that I fixed your example (the r was misplaced) -- it doesn't change your point.

I think that forcing double \\ should be part of a different PR, if that's what we want to do.

Edit: as a minor note, this would be a breaking change to existing string interpolations in 2.0: {"path": "${drive}:\test"}, as I'm assuming that if we consistently enforce \\ in quoted strings, we should also do it in top-level strings (i.e. outside of interpolations)

Isn't that the case now in Hydra? (and for that matter, in OmegaConf as well?)

Not sure about Hydra. But in OmegaConf a single backslash is ok: the fact that we un-escape double \\ into \ doesn't mean we have to touch single \.

Ok.

The escaping in Python strings is not our concern. A user can use raw string to make things cleaner.

Maybe. There are a few places where raw strings don't work though, and there's also shell escaping (e.g. with Hydra).

Escaping is not a thing in bash, They do have a syntax for literal strings though:
https://unix.stackexchange.com/questions/30903/how-to-escape-quotes-in-shell

The double unescaping is an implementation detail, I am not sure it's really needed to make things work.

I'm not sure either but at this point I've given up on fixing it (besides going the other way that you're suggesting, i.e. what I did in #621 where we only escape backslashes before quotes or interpolations).
If you have an idea that you believe may work, let me know what it is and I can try to either implement it or show you why it doesn't work :)

Can you create a small example and explain how it leads to double unescaping?
This will help me play with it.

With those two statements, the above should become which seems reasonable to me. (...)

Just a note that I fixed your example (the r was misplaced) -- it doesn't change your point.

I think that forcing double \\ should be part of a different PR, if that's what we want to do.

Edit: as a minor note, this would be a breaking change to existing string interpolations in 2.0: {"path": "${drive}:\test"}, as I'm assuming that if we consistently enforce \\ in quoted strings, we should also do it in top-level strings (i.e. outside of interpolations)

I don't think we should make big changes (if any) outside of quoted strings and I think I am convinced we should not require escaping of free backslash if we can make it work and be consistent.

I introduced escaping in the Hydra grammar to support single quote in a single quoted string and double quote in a double quoted string.
The introduction of nested interpolations added more use cases to escaping:

Nesting of quoted strings for the purposes of nested interpolations.
This is a pretty rare one. and in most cases using a different quote style should be enough.

'hello ${identity:"James Bond"}' # however this turns weird if you have another level: 'hello ${identity:"James ${identity:"bond"}"}' # or this? 'hello ${identity:"James ${identity:'bond'}"}' # Both are wrong in their own way. # But this should probably work: 'hello ${identity:"James ${identity:\'bond\'}"}'

The above does not require additional double escaping because we are using the other quote style as well.
If we didn't:

'hello ${identity:\'James ${identity:\\'bond\\'}'\}'

Escaping the interpolation itself:

\${aa} \${foo.bar} \${oc.env:USER}

multiple escaping should never be needed here. I think it's required because of your current implementation of the unescaping mechanism.

# no escaping: ${foo:${bar:zoo}} # both \${foo:\${bar:zoo}} # inner only: ${foo:\${bar:zoo}}

This can be potentially be handled by introducing a token like ESC_INTER_OPEN: \${
By the way, this should also work outside of quoted strings because support interpolations in unquoted strings.

PS: None of the examples above are tested so mistakes are likely.

When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).

Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?

In interpretation 1, how do you write b so that it resolves to \10?

No can do.
Do you have a test that ensuring that this is covered?

(in interpretation 2, you can write b: '\${a}' so that it resolves to ${a})

Can you explain the process getting from \\\${a} to \${a} (in the new code).

When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).

Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?

I had a long reply with a detailed example and my laptop crashed on me :/ So here's the short version, let me know if unclear:

"The new code" (in this PR) is just a grammar change (it doesn't change how the visitor works)

Escaped interpolations \${ are only un-escaped in top-level strings, so they get un-escaped only once. However, currently the second phase may indeed trigger multiple un-escaping of \\ into \ in case of nested quoted strings (this is one thing [ON HOLD] Simpler escaping in the grammar, and fix to quoted values #621 was meant to avoid).

In interpretation 1, how do you write b so that it resolves to \10?

No can do.
Do you have a test that ensuring that this is covered?

Yes, this is this test:

("str_top_esc_backslash", r"Esc: \\${str}", r"Esc: \hi"),

(in interpretation 2, you can write b: '${a}' so that it resolves to ${a})

Can you explain the process getting from \\\${a} to \${a} (in the new code).

It's the same in current master:

The lexer turns \\\${a} into this sequence of tokens: ESC (matching \\), ESC_INTER (matching \${), TOP_STR (maching a}

This is recognized as a toplevelStr by the parser

The visitor code visiting this node is simply return self._unescape(ctx.getChildren()) which will concatenate the un-escaped string representation of these 3 tokens, i.e. \, ${ and a}, yielding the final result \${a}

When the grammar visitor sees a QUOTED_VALUE, it removes the enclosing quotes, does an un-escape pass (currently: un-escaping " into " and \ into , for strings with enclosing double quotes), then re-runs the whole parsing logic over the resulting string => this is when the lexer will see a ${ as an ESC_INTER token (assuming it's at top level).

Are there scenarios in the new code where the second phase is happening more than once on the same characters?
(e.g, what happens in nested quoted strings)?

I had a long reply with a detailed example and my laptop crashed on me :/ So here's the short version, let me know if unclear:

"The new code" (in this PR) is just a grammar change (it doesn't change how the visitor works)

Escaped interpolations \${ are only un-escaped in top-level strings, so they get un-escaped only once. However, currently the second phase may indeed trigger multiple un-escaping of \\ into \ in case of nested quoted strings (this is one thing [ON HOLD] Simpler escaping in the grammar, and fix to quoted values #621 was meant to avoid).

Okay, so once this is in, #621 is supposed to address the repeated un-escaping issue?

Can you explain the process getting from \\\${a} to \${a} (in the new code).

It's the same in current master:

The lexer turns \\\${a} into this sequence of tokens: ESC (matching \\), ESC_INTER (matching \${), TOP_STR (maching a}

This is recognized as a toplevelStr by the parser

The visitor code visiting this node is simply return self._unescape(ctx.getChildren()) which will concatenate the un-escaped string representation of these 3 tokens, i.e. \, ${ and a}, yielding the final result \${a}

Gotcha, this answers my question.
Can you explain why we need a toplevelStr?

Okay, so once this is in, #621 is supposed to address the repeated un-escaping issue?

That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).

Can you explain why we need a toplevelStr?

This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said. Here, there would be two top-level strings: "Hi and ", he said.

That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).

Cool. I think we should try to fix it. I am asking all those questions because I am trying to reach the best solution, not because I think this is not important enough to fix.

This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said. Here, there would be two top-level strings: "Hi and ", he said.

Gotcha.

omry · 2021-04-05T20:50:06Z

tests/test_grammar.py

@@ -137,6 +137,23 @@
    ("str_quoted_too_many_1", "''a'", GrammarParseError),
    ("str_quoted_too_many_2", "'a''", GrammarParseError),
    ("str_quoted_too_many_3", "''a''", GrammarParseError),
+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
+    ("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"),
+    ("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"),


That was my original plan, though I'm open to other approaches for un-escaping (including not actually fixing it).

Cool. I think we should try to fix it. I am asking all those questions because I am trying to reach the best solution, not because I think this is not important enough to fix.

This is for string interpolations, so that users can use any character they want outside of the interpolation part, including quotes. You might remember an example we discussed back in the days, that was something like: "Hi ${name}", he said. Here, there would be two top-level strings: "Hi and ", he said.

Gotcha.

omry · 2021-04-05T20:54:48Z

tests/test_grammar.py

+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),
+    ("str_quoted_trailing_esc_2", "'abc\\\\\\\\'", "abc\\\\"),
+    ("str_quoted_no_esc_1", '"abc\\def"', "abc\\def"),
+    ("str_quoted_no_esc_2", '"abc\\\\def"', "abc\\def"),


To make those tests easier to understand, can you use raw strings?

Suggested change

("str_quoted_no_esc_2", '"abc\\\\def"', "abc\\def"),

("str_quoted_no_esc_2", r'"abc\\def"', r"abc\def"),

Done in 3457d0c

(note: one thing a bit annoying with raw strings is they don't "work" for strings that end with a \ or contain an escaped quote of the same quote type as the enclosing quotes, e.g. "abc\"def" => this is why there still remain some non-raw strings)

omry

Tests looks good.
I am starting to think that maybe we should drive those specific tests from an input text file to eliminate the Python shenanigans.

I added a comment about possibly adding more tests here. (maybe they already covered).

omry · 2021-04-08T18:16:59Z

tests/test_grammar.py

@@ -137,6 +137,23 @@
    ("str_quoted_too_many_1", "''a'", GrammarParseError),
    ("str_quoted_too_many_2", "'a''", GrammarParseError),
    ("str_quoted_too_many_3", "''a''", GrammarParseError),
+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),


Can you add the missing tests here?
They will make it easier for me to form a mental model of what the behavior is after this change.
some input strings single quoted strings.

'' 'a' '\a' 'a\' '"' '\'' 'c:\\''

Here are some nested quoted strings cases, some may be handled in #621:

'${f:"b"}' # nested double quotes '${f:\'b\'}' # nested escaped single quotes

Same as above, but for an escaped interpolation. what is the expected behavior here?
enerally speaking, is parsing inside interpolations the same as parsing outside?

'\${f:"b"}' '\${f:\'b\'}'

omry · 2021-04-14T21:25:38Z

tests/test_grammar.py

+    ("str_quoted_tmp_2", r"'a\'", GrammarParseError),
+    ("str_quoted_inside_quote_different", "'\"'", '"'),
+    ("str_quoted_inside_quote_same", r"'\''", "'"),
+    ("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),


This works?

Suggested change

("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),

("str_quoted_trailing_backslash", r"'c:\\'", "c:\"),

("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),

Loos like rstrings are working for it:

print(r"'c:\\'") 'c:\\'

I would expect \\ to be interpreted as a token equal to \, so the content of the quoted string should be c:\

That would work (after fixing "c:\" which isn't a valid Python string) but there's already a similar test so no need to add it:

("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),

r-strings work as long as the "escaped" quote isn't the same type as the enclosing quote.

That test is similar, but is testing for a double trailing backslash.
Since this is all about backslash and escaping, I think an additional basic test for a single backslash is worth it.

And now I realize that the double backslash is due to Python interpretation of escaping (leaving it as is to make the point of how confusing this is that it trips me even when I am fully in context).
Why not use r-strings here?

In [3]: print(r"'abc\\'") 'abc\\'

Yeah it's a bit tricky to get those \ right :)

Why not use r-strings here?

I explained my reasoning in #630 (comment) (I don't mind changing it though if you like the "partial" r-string version better)

yeah, let's use r-strings whenever possible, where it's not possible add a comment.
This part of the tests could be better tested by reading raw data from a file instead of encoding the data data in Python but I am not sure it's worth it to do it just for the escaping.

Ok, done in 9011928
I used a couple of tricks to use raw strings everywhere -- I can revert those if you don't like them:

Using .strip() for strings ending in with a \, e.g.

("str_quoted_trailing_esc_1", r"'abc\\'", r" abc\ ".strip()),

(technically the leading whitespace is not needed, but I find it looks better)

Using an f-string to include tabs in r-strings, e.g.

("str_esc_ws_2", fr"\ \{TAB}\{TAB}", f" {TAB}{TAB}"),

where TAB is a constant defined as

TAB = "\t" # to be used in raw strings, e.g. `fr"C:\{TAB}foo"`

When reviewing I was scratching my head a bit before figuring out the strip trick.
I think it's fine, but add a comment.
maybe something like:

# raw strings does not allow trailing \, adding a space and stripping it.

Done in 6015a71

omry · 2021-04-14T21:28:04Z

tests/test_grammar.py

@@ -137,6 +137,23 @@
    ("str_quoted_too_many_1", "''a'", GrammarParseError),
    ("str_quoted_too_many_2", "'a''", GrammarParseError),
    ("str_quoted_too_many_3", "''a''", GrammarParseError),
+    ("str_quoted_trailing_esc_1", "'abc\\\\'", "abc\\"),


is actually invalid because "'abc\'" is the string with characters 'abc' and is an invalid quoted string according to the grammar, since the end quote is escaped (=> it's missing another end quote).

I think I meant the characters 'abc\\'. This is again getting confusing because of Python interpreting the backslashes as well. Is this one of the cases where r strings are not working?

Thanks for clarifying the parsing logic.

Fixes omry#617

odelalleau · 2021-04-16T22:15:13Z

Just rebased on top of master to solve a (minor) merge conflict

omry · 2021-04-21T19:51:57Z

will take a look soon, this week is the week of PTED21 so we have been focused on it.

tests/test_grammar.py

omry · 2021-04-22T20:16:17Z

tests/test_grammar.py

+    ("str_quoted_tmp_2", r"'a\'", GrammarParseError),
+    ("str_quoted_inside_quote_different", "'\"'", '"'),
+    ("str_quoted_inside_quote_same", r"'\''", "'"),
+    ("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),


When reviewing I was scratching my head a bit before figuring out the strip trick.
I think it's fine, but add a comment.
maybe something like:

# raw strings does not allow trailing \, adding a space and stripping it.

omry · 2021-04-22T20:22:16Z

tests/test_grammar.py

@@ -227,10 +253,14 @@
    # Interpolations in quoted strings.
    ("str_quoted_inter", "'${null}'", "None"),
    ("str_quoted_esc_single_1", r"'ab\'cd\'\'${str}'", "ab'cd''hi"),
-    ("str_quoted_esc_single_2", "'\"\\\\\\\\\\${foo}\\ '", r'"\${foo}\ '),
+    ("str_quoted_esc_single_2", r"""'"\\\\\${foo}\ '""", r'"\${foo}\ '),


Can you explain this one? I find it surprising that we need 5 slashes before the $ and 1 slash after the }.
Is this something that will be addressed by the followup double escaping diff? if so, what will it look like after we eliminate the double unescaping?

Yes this is related to the double-unescaping. Starting from 5 backslashes:

First un-escaping of quoted string: \\\\\ ==> \\\

Second un-escaping of top-level strings for escaped interpolations: \\\${ => \${

The slash after the } wasn't really useful to test I think, so I removed it in b678b7c (at the same time, I removed the extra quote at the beginning of the string and instead added another test specifically for quotes, which is clearer)

omry · 2021-04-22T20:40:56Z

tests/test_grammar.py

@@ -250,7 +280,7 @@
    ("nested_select", "${options.${choice}}", "A"),
    ("nested_select_getitem", "${options[${choice}]}", "A"),
    ("nested_relative", "${${rel_opt}.b}", "B"),
-    ("str_quoted_nested", r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'", "ABCDEFGH"),
+    ("str_quoted_nested", r"'AB${test:\'CD${test:\\\'EF\\\'}GH\'}'", "ABCDEFGH"),


Can you explain this one?
Is this something that will be addressed by the followup double escaping diff?

No, I would expect this one to remain unchanged. Here's what happens:

Un-escaping of quoted string: AB${test:\'CD${test:\\\'EF\\\'}GH\'} ==> AB${test:'CD${test:\'EF\'}GH'}

Un-escaping of top-level strings: no change because the only top-level string is AB, the rest is an interpolation and interpolations aren't touched in this step

We then start processing the interpolation

Un-escaping of quoted string: CD${test:\'EF\'}GH ==> CD${test:'EF'}GH

Un-escaping of top-level strings: no change (the top-level strings are CD and GH)

We then process the interpolation which is a normal one (yielding the string EF)

The reason why it worked before with fewer backslashes in the middle is because a string like 'foo\\'bar' was accepted as a valid quoted string by the grammar, while this is not the case anymore: now this would be parsed as a quoted string 'foo\\' followed by bar' which is an invalid expression. I don't think we could have kept the previous behavior without facing the subtle bugs of #617.

So the reason we are unescaping in top level strings is just to support escaping interpolations there?
If so, would it simplify things to just say that we only apply the grammar inside interpolations and thus only support escaping of nested interpolations?

I am a bit confused by this:

input: r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'" actual string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'

This is a string outside of interpolation, and it's quoted. why is the result not quoted as well?
I am assuming that this string is interpreted as a value of a config node, so there is no yaml involved here.

I don't like how we need to unescape the nested quoted strings.
feels like it adds a lot of complexity to the mental model required to understand the behavior of quoted strings. before diving into this one, I want to understand 2.

So the reason we are unescaping in top level strings is just to support escaping interpolations there?

Yes.

If so, would it simplify things to just say that we only apply the grammar inside interpolations and thus only support escaping of nested interpolations?

I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR} if we can't escape interpolations in top-level strings?

I am a bit confused by this:

input: r"'AB${test:\'CD${test:\\'EF\\'}GH\'}'" actual string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'

This is a string outside of interpolation, and it's quoted. why is the result not quoted as well?
I am assuming that this string is interpreted as a value of a config node, so there is no yaml involved here.

Oh! Careful, there are several tests in the file, and this one belongs to the tests for the singleElement parser rule (not configValue), so this is parsed as a quoted string like you would use as a resolver argument.

I don't like how we need to unescape the nested quoted strings.
feels like it adds a lot of complexity to the mental model required to understand the behavior of quoted strings. before diving into this one, I want to understand 2.

Let me know how you feel now that hopefully point 2 is clearer.

I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR} if we can't escape interpolations in top-level strings?

How would you do it in OmegaConf 2.0 (not supported at all as far as I can tell).

In principle, you could use decode (although I think it's not working right now for this)

a: ${oc.decode:'Please use \${VAR}'}

Oh! Careful, there are several tests in the file, and this one belongs to the tests for the singleElement parser rule (not configValue), so this is parsed as a quoted string like you would use as a resolver argument.

This is one of the reasons I don't like the current testing approach: it makes it hard to understand what the test data is being used for.

Let me know how you feel now that hopefully point 2 is clearer.

Okay, so back to nested quoted strings:

outer quoted string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'

At some point we considered using lexer modes to support something like this:

'AB${test:'CD${test:'EF'}GH'}'

(Correct me if I am wrong and lexer modes will not actually enable this).
This is becoming much more attractive after the back and forth we had on this issue.
Can we pull this off?
With the exception of the time investment, do you see any downsides to suspect additional corner cases?

I may misunderstand what you're suggesting, but how would you set a config option that evaluates to the string with characters Please use ${VAR} if we can't escape interpolations in top-level strings?

How would you do it in OmegaConf 2.0 (not supported at all as far as I can tell).

Indeed, that was a gap in 2.0. Though you could use ${} as long as it didn't look like an interpolation, for instance "Please use ${VAR!}" would work.

This is one of many enhancements in 2.1, but it's not even a requested one so I would be okay with an alternative implementation if it's reasonably clean and simplify out lives.

In principle, you could use decode (although I think it's not working right now for this)

a: ${oc.decode:'Please use \${VAR}'}

The example you provide doesn't work because the string that is given as input to oc.decode has characters Please use ${VAR} so it crashes trying to resolve the interpolation (in oc.decode).

This is a feature of oc.decode:

I am not sure why it should resolve the interpolation in the first place.

the interpolation is escaped, so it should understand that and not try to resolve it (in principle, I am not talking about the current implementation).

I'm actually surprised that it's allowed by the grammar because I didn't realize that you could have interpolations in unquoted strings (oc.decode uses the singleElement parser rule). So for instance you can write this:

I used a quoted string, so I am not sure why you were surprised.

cfg = OmegaConf.create({"x": "foo", "y": "${identity:abc ${x}}"}) assert cfg.y == "abc foo"

I'm not sure this is even tested right now, I'll double check later.

I have done it before, I usually quote just for clarity in scenarios where the string contains spaces.
Let me know if you find why this works :).

This works with oc.decode:

cfg = OmegaConf.create({"x": r"""${oc.decode:'"Please use \\\\\${VAR}"'}"""}) assert cfg.x == "Please use ${VAR}"

That being said, even if we decide that this is how to do it, it still relies on the escaping of interpolations at the top-level, because the parsing of a quoted string triggers a top-level parsing after the removal of quotes & un-escaping. So I'm not sure what we would gain.

Obviously this is not great. but if we can make a: ${oc.decode:'Please use \${VAR}'} work I would vote to simplify top-level string and maybe even drop it completely.

Okay, so back to nested quoted strings:

outer quoted string: 'AB${test:\'CD${test:\\'EF\\'}GH\'}'

At some point we considered using lexer modes to support something like this:

'AB${test:'CD${test:'EF'}GH'}'

(Correct me if I am wrong and lexer modes will not actually enable this).
This is becoming much more attractive after the back and forth we had on this issue.
Can we pull this off?
With the exception of the time investment, do you see any downsides to suspect additional corner cases?

Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:

'AB${test:"CD${test:'EF'}GH"}'

I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).

I think it should work fine (it used to), just flagging that it means we would now parse the content of quoted strings when we parse an expression (not really a problem, just to make it clear), instead of triggering this parsing only on evaluation.
The main downside I see is the one you had raised initially: it may be an unusual approach that could surprise people.

After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.

In principle, you could use decode (although I think it's not working right now for this)

a: ${oc.decode:'Please use \${VAR}'}

The example you provide doesn't work because the string that is given as input to oc.decode has characters Please use ${VAR} so it crashes trying to resolve the interpolation (in oc.decode).

This is a feature of oc.decode:

I am not sure why it should resolve the interpolation in the first place.

the interpolation is escaped, so it should understand that and not try to resolve it (in principle, I am not talking about the current implementation).

oc.decode currently relies on the same parsing (& visitor) code as the one used to parse (& resolve) the singleElement rule of the grammar, which resolves interpolations.

But oc.decode doesn't see the escaped interpolation: it sees the string with characters Please use ${VAR} where the interpolation isn't escaped. The reason is that the quoted string 'Please use \${VAR}' is first resolved into Please use \${VAR} then passed to oc.decode. So what you're trying to achieve would be done in a simpler way with an identity resolver that would return this string unchanged (instead of oc.decode which parses it according to the grammar rules).

I'm actually surprised that it's allowed by the grammar because I didn't realize that you could have interpolations in unquoted strings (oc.decode uses the singleElement parser rule). So for instance you can write this:

I used a quoted string, so I am not sure why you were surprised.

Hopefully the above comment makes it clearer: oc.decode sees the string with characters Please use ${VAR} which is unquoted.

cfg = OmegaConf.create({"x": "foo", "y": "${identity:abc ${x}}"}) assert cfg.y == "abc foo"

I'm not sure this is even tested right now, I'll double check later.

I have done it before, I usually quote just for clarity in scenarios where the string contains spaces.
Let me know if you find why this works :)

Oh, a quick look at the grammar makes it clear why it works. I was just surprised because for some reason I thought it wouldn't. I just want to make sure it's properly tested.

This works with oc.decode:

cfg = OmegaConf.create({"x": r"""${oc.decode:'"Please use \\\\\${VAR}"'}"""}) assert cfg.x == "Please use ${VAR}"

That being said, even if we decide that this is how to do it, it still relies on the escaping of interpolations at the top-level, because the parsing of a quoted string triggers a top-level parsing after the removal of quotes & un-escaping. So I'm not sure what we would gain.

Obviously this is not great. but if we can make a: ${oc.decode:'Please use \${VAR}'} work I would vote to simplify top-level string and maybe even drop it completely.

I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of \\ from quoted strings (as in #621) then we can probably shave off a few \ from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.

At some point we considered using lexer modes to support something like this:

'AB${test:'CD${test:'EF'}GH'}'

Can we pull this off?

Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:

'AB${test:"CD${test:'EF'}GH"}'

I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).

Oh yes you are right, thanks! I'm not sure where this idea of alternating came from -- for some reason I thought I remembered you had to alternate, but I think I just imagined it.

After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.

Alright, I'm happy to do it, but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance "there is a single quote ' in the middle". I haven't thought in depth about it though.

But oc.decode doesn't see the escaped interpolation: it sees the string with characters Please use ${VAR} where the interpolation isn't escaped. The reason is that the quoted string 'Please use \${VAR}' is first resolved into Please use \${VAR} then passed to oc.decode. So what you're trying to achieve would be done in a simpler way with an identity resolver that would return this string unchanged (instead of oc.decode which parses it according to the grammar rules).

The line of reasoning here seems wrong or missing a step.
But in any case, we shouldn't force decode to do what its' not meant to do.
IF identity works, we can introduce it under some name that makes sense. The use case for not resolving interpolations is very rare and I don't mind if it takes a bit of extra work.

Obviously this is not great. but if we can make a: ${oc.decode:'Please use \${VAR}'} work I would vote to simplify top-level string and maybe even drop it completely.

I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of \\ from quoted strings (as in #621) then we can probably shave off a few \ from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.

Can you explain what you mean by "since the parsing of quoted strings relies on it as well"?
As far as I understand, quoted strings are only special inside interpolations.
Outside interpolations we can have whatever we want, including quotes.

At some point we considered using lexer modes to support something like this:

'AB${test:'CD${test:'EF'}GH'}'

Can we pull this off?

Almost -- you actually need to alternate quote styles to make it work, otherwise there is no way to differentiate (in the greedy way the lexer works) a closing quote vs. one that starts a new segment. So it should instead be:

'AB${test:"CD${test:'EF'}GH"}'

I am not sure there is no way to differentiate:
Once we are in a quoted string (say single), hitting another single quote closes it.
However, if we hit a double quote we are good, but if we hit an interpolation token (IE, non-greedy),
we can push a mode that allows us to start a new single quoted string.
when we seen an interpolation end, we pop (By the way, this might make it impossible for regex to handle with the grammar).

Oh yes you are right, thanks! I'm not sure where this idea of alternating came from -- for some reason I thought I remembered you had to alternate, but I think I just imagined it.

Maybe alternating was something we considered before we found out about lexer modes.

After seeing what where the alternative path leads, I am much more open to a bit of weirdness in exchange for something that is much easier to grok.

Alright, I'm happy to do it,

Great, so just to make sure I understand:
In the case of something like:
${foo: 'outer ${bar: 'inner' }' }

To be clear, this is the inner interpolation: ${bar: 'inner' }
It lives in the quotes string 'outer ${bar: 'inner' }',
which is a parameter of the resolver ${foo: 'outer ${bar: 'inner' }' }

What would the tokens be be?
Does it mean that we are promoting quoted string from a token to a parser rule?

but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance "there is a single quote ' in the middle". I haven't thought in depth about it though.

I think we still need those: how would you create a quoted string that contains both quotes?

single: ', double: "

In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.

Just to follow-up on:

Oh, a quick look at the grammar makes it clear why it works. I was just surprised because for some reason I thought it wouldn't. I just want to make sure it's properly tested.

==> It's indeed properly tested (there is a section "Interpolations in unquoted strings"), so no worries there (except for my failing memory).

But oc.decode doesn't see the escaped interpolation: it sees the string with characters Please use ${VAR} where the interpolation isn't escaped. The reason is that the quoted string 'Please use \${VAR}' is first resolved into Please use \${VAR} then passed to oc.decode. So what you're trying to achieve would be done in a simpler way with an identity resolver that would return this string unchanged (instead of oc.decode which parses it according to the grammar rules).

The line of reasoning here seems wrong or missing a step.

It may become clearer with a simpler example:

cfg = OmegaConf.create({"x": "foo", "y": "${identity:'x is ${x}'}"}) cfg.y # 'x is foo'

What happens is that in the visitor, when we are to resolve ${identity:'x is ${x}'} in visitInterpolationResolver(), we first obtain the name of the resolver (identity) then obtain the list of arguments to the resolver from visitSequence(), which calls visitElement() on each argument. The quoted value 'x is ${x}' is a primitive and its resolution triggers a new parsing pass over its content (x is ${x}) where the interpolation is resolved, resulting in the string x is foo. Then this string is passed to the resolver function associated to identity, which returns it unchanged.

The main point here is that the quoted string is evaluted before it is passed to the resolver. So, now, if you escape the interpolation:

cfg = OmegaConf.create({"x": "foo", "y": r"${identity:'x is \${x}'}"}) cfg.y # 'x is ${x}'

==> the exact same steps are applied, except that the resolution of the quoted value 'x is \${x}' results in the string x is ${x} which is passed as input to the resolver function associated to identity.

Hopefully this makes it clearer why oc.deode sees the string Please use ${VAR} as input in our previous example.

Obviously this is not great. but if we can make a: ${oc.decode:'Please use \${VAR}'} work I would vote to simplify top-level string and maybe even drop it completely.

I don't think we can make this one work (for the reason explained above), but if we remove the un-escaping of \\ from quoted strings (as in #621) then we can probably shave off a few \ from my example.
I'm not really sure how we would "simplify top-level string and maybe even drop it completely", since the parsing of quoted strings relies on it as well.

Can you explain what you mean by "since the parsing of quoted strings relies on it as well"?
As far as I understand, quoted strings are only special inside interpolations.
Outside interpolations we can have whatever we want, including quotes.

What I mean is that the resolution of a quoted string currently works as follows:

Remove the enclosing quotes (let's say these are double quotes ")

Un-escape \\ and \" (not escaped interpolations)

Trigger a new parsing + resolution of the resulting string with the configValue rule (which relies on top-level strings)

This will change with the changes discussed below though, but maybe we can discuss it in another PR related to these changes.

Maybe alternating was something we considered before we found out about lexer modes.

Great, so just to make sure I understand:
In the case of something like:
${foo: 'outer ${bar: 'inner' }' }

To be clear, this is the inner interpolation: ${bar: 'inner' }
It lives in the quotes string 'outer ${bar: 'inner' }',
which is a parameter of the resolver ${foo: 'outer ${bar: 'inner' }' }

What would the tokens be be?
Does it mean that we are promoting quoted string from a token to a parser rule?

Yes.
I have a backup of my old branch which should make things clear:
Lexer: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationLexer.g4
Parser: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationParser.g4

but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance "there is a single quote ' in the middle". I haven't thought in depth about it though.

I think we still need those: how would you create a quoted string that contains both quotes?

Yes indeed, I thought of the same example afterwards. Although I'm fine with keeping escaped quotes, just note that since it's a pretty rare situation we could also instead use resolvers (and we could do the same for escaped interpolations). Something like:

"single ' quote, double ${oc.esc.quote:double} quote, escaped ${oc.esc.interpolation:open} interpolation }"

which would be resolved into single ' quote, double " quote, escaped ${ interpolation }.
(just throwing it out there as an idea, I don't feel strongly either way)

In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.

Agreed.

What I mean is that the resolution of a quoted string currently works as follows:

Remove the enclosing quotes (let's say these are double quotes ")

Un-escape \\ and \" (not escaped interpolations)

Trigger a new parsing + resolution of the resulting string with the configValue rule (which relies on top-level strings)

This will change with the changes discussed below though, but maybe we can discuss it in another PR related to these changes.

Yes, it will be easier to discuss in context, and we are considering implementation changes to this anyway.

Does it mean that we are promoting quoted string from a token to a parser rule?

Yes.
I have a backup of my old branch which should make things clear:
Lexer: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationLexer.g4
Parser: https://github.com/odelalleau/omegaconf/blob/backup_2020_08_09/omegaconf/grammar/InterpolationParser.g4

Great.

but then we also need to decide if we still want to support escaped quotes in quoted strings. I think we don't need them anymore because we should always be able to use a different enclosing quote style instead, for instance "there is a single quote ' in the middle". I haven't thought in depth about it though.
I think we still need those: how would you create a quoted string that contains both quotes?
Yes indeed, I thought of the same example afterwards. Although I'm fine with keeping escaped quotes, just note that since it's a pretty rare situation we could also instead use resolvers (and we could do the same for escaped interpolations). Something like:

"single ' quote, double ${oc.esc.quote:double} quote, escaped ${oc.esc.interpolation:open} interpolation }"

I agree that it's a rare case. but the standard approach for it (in most grammars) is to escape and it might be worth supporting directly (people would intuitively try to escape a quote like that).
My gut feeling is that if we switch to the proposed approach, the rare cases where we need to escape will be much easier to understand, so they would no longer be a concern.

In any case, before you start working on it, what do you think about merging what we have now? It will be easier starting from a clean slate.

Agreed.

Will make a final pass soon, not sure I will have time today (Friday is when I have the most meetings).

omry

You know what, I think this is good to go.
Go ahead.

odelalleau mentioned this pull request Mar 19, 2021

[ON HOLD] Simpler escaping in the grammar, and fix to quoted values #621

Closed

odelalleau requested a review from omry March 22, 2021 15:31

omry requested changes Mar 23, 2021

View reviewed changes

omegaconf/grammar/OmegaConfGrammarLexer.g4 Show resolved Hide resolved

tests/test_grammar.py Outdated Show resolved Hide resolved

omry reviewed Mar 23, 2021

View reviewed changes

odelalleau requested a review from omry March 24, 2021 20:14

omry reviewed Apr 5, 2021

View reviewed changes

omry reviewed Apr 8, 2021

View reviewed changes

omry reviewed Apr 14, 2021

View reviewed changes

odelalleau added 4 commits April 16, 2021 18:14

Fix edge cases in the parsing of quoted values

32d84f9

Fixes omry#617

Add more quoted strings tests

4c61ce5

Use raw strings for more readable tests

4a72b1c

Add more tests

0d7b358

odelalleau force-pushed the better_escaping_v3 branch from 7f23c1b to 0d7b358 Compare April 16, 2021 22:14

Use raw strings as much as possible in test_grammar.py

9011928

odelalleau force-pushed the better_escaping_v3 branch from aa8110f to 9011928 Compare April 17, 2021 12:45

odelalleau requested a review from omry April 21, 2021 17:27

omry reviewed Apr 22, 2021

View reviewed changes

odelalleau added 2 commits April 22, 2021 17:12

Explain the stripping hack

6015a71

Simplify some tests

b678b7c

omry approved these changes Apr 23, 2021

View reviewed changes

odelalleau merged commit e56d1f8 into omry:master Apr 26, 2021

odelalleau deleted the better_escaping_v3 branch April 26, 2021 11:52

odelalleau mentioned this pull request Apr 28, 2021

New syntax for nested quoted values #695

Merged

	("str_quoted_no_esc_2", '"abc\\\\def"', "abc\\def"),
	("str_quoted_no_esc_2", r'"abc\\def"', r"abc\def"),

	("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),
	("str_quoted_trailing_backslash", r"'c:\\'", "c:\"),
	("str_quoted_extra_quote", r"'c:\\''", GrammarParseError),

Fix edge cases in the parsing of quoted values #630

Fix edge cases in the parsing of quoted values #630

Conversation

odelalleau commented Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

omry Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

odelalleau Mar 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry Apr 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

odelalleau commented Apr 16, 2021

omry commented Apr 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry Apr 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omry left a comment

Choose a reason for hiding this comment

odelalleau commented Mar 19, 2021 •

edited

Loading

odelalleau Mar 23, 2021 •

edited

Loading

omry Mar 23, 2021 •

edited

Loading

odelalleau Mar 23, 2021 •

edited

Loading

odelalleau Mar 31, 2021 •

edited

Loading

omry left a comment •

edited

Loading

omry Apr 14, 2021 •

edited

Loading

omry Apr 23, 2021 •

edited

Loading