Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

odelalleau · 2021-04-07T20:11:07Z

Describe the bug
The following code triggers a GrammarParseError:

cfg = OmegaConf.create({"x": "\\${not an interpolation}"})
cfg.x

Expected behavior

There should be no grammar syntax constraint after a \${ since it is not a real interpolation.

Discussion

What happens is that when we dereference cfg.x, the resulting string ("${not an interpolation}") is wrapped within a node (AnyNode because here x is untyped), whose constructor calls ValueNode._set_value(), that checks if the the value is an interpolation... and this check raises an exception because it's a badly formed interpolation.

The bandaid fix is to set strict_interpolation_validation=False at

omegaconf/omegaconf/nodes.py

Line 37 in a03f681

value, strict_interpolation_validation=True

It would break the tests that specifically test for this check, but overall things would still work fine.

This bandaid fix, however, wouldn't change the fact that cfg._get_node("x")._dereference_node()._is_interpolation() is True, i.e., when we dereference cfg.x we obtain a node that believes it is an interpolation (!). Right now this isn't really a problem because we don't do anything with this node except getting its value, but it may bite us later.

Right now I don't have a solution that I really like in mind, so posting this issue for discussion.
Here are options I've thought about so far:

The bandaid fix mentioned above (currently my preferred option for 2.1)
Creating a subclass of StringNode specifically when an interpolation result is a string with ${. This subclass would override what it takes to avoid a crash. One issue with this is that if cfg.x is actually typed, we'd like to return another type of node (ex: IntegerNode), so there should be an exception instead (it seems doable, just a bit clumsy)
Adding a flag to StringNode that says "I am not an interpolation" even if the value looks like one. This would be similar to the previous idea, but without adding an extra class.
Add a more general flag to all nodes, is_interpolation={True,False,None}, where None would be the current behavior (check if the value is a string looking like an interpolation), False would make it possible to "declare" the value as not being an interpolation (even if it looks like one), and True would declare it as being an interpolation (regardless of the value). Besides fixing this issue, this could enable more things like (i) caching the interpolation status in that flag for faster checks, or (ii) enabling new kinds of interpolations that aren't tied to the string content (e.g. to point to config nodes that aren't accessible by a path from the root config, as I once did in a draft PR). I kinda like this option but I expect it to be non trivial and uncover issues I haven't thought about yet, so I'd rather not go there in 2.1

Thoughts?

The text was updated successfully, but these errors were encountered:

omry · 2021-04-08T02:12:15Z

The bandaid fix as you call it is also a performance concern.

The idea with having a fast is_interpolation that can return False positives was to improve performance by avoiding full grammar parse to detect interpolations.

The first solution should be acceptable if the performance is good in some real world scenarios.

The second solution sounds terrible.

The last two are going in a direction of caching the parsing results.
I think this is a good direction, we can cache the output of a strict get_value_kind and even the parse tree itself if it makes sense.
This should definitely not happen in 2.1 though, I think it's a pretty big change and it's not trivial to achieve that without potentially paying a significant cost during config initialization (because we are doing more real parses).

Right now the logic in get_value_kind is:

    # We identify potential interpolations by the presence of "${" in the string.
    # Note that escaped interpolations (ex: "esc: \${bar}") are identified as
    # interpolations: this is intended, since they must be processed as interpolations
    # for the string to be properly un-escaped.
    # Keep in mind that invalid interpolations will only be detected when
    # `strict_interpolation_validation` is True.
    if isinstance(value, str) and "${" in value:
        if strict_interpolation_validation:
            # First try the cheap regex matching that detects common interpolations.
            if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
                # If no match, do the more expensive grammar parsing to detect errors.
                parse(value)
        return ValueKind.INTERPOLATION

One thing that looks weird to me is: what is the purpose of calling parse(value)?
Is the intention if get_value_kind to raise an exception?
I think we have discussed it, but looking now it looks weird. My expectation is that if the string is not valid then ValueKind should be VALUE.

Second thing that jumps out:
why do we need the flag strict_interpolation_validation?
What would be the cost of just running the regex for any string that passes the initial ${ sniff test?

odelalleau · 2021-04-08T02:42:11Z

The bandaid fix as you call it is also a performance concern.

It's actually improving performance (strict_interpolation_validation=False is faster, it just won't detect grammar errors).

This should definitely not happen in 2.1 though

Agreed.

One thing that looks weird to me is: what is the purpose of calling parse(value)?
Is the intention if get_value_kind to raise an exception?

Yes. If I remember correctly personally I was fine with having strict_interpolation_validation=False everywhere and letting the code crash on access, but you preferred to have a check earlier (which led to the simple regex to avoid paying too much of a performance hit).

I think we have discussed it, but looking now it looks weird. My expectation is that if the string is not valid then ValueKind should be VALUE.

I think that would cause problems because badly formed interpolations wouldn't be detected anymore.

Second thing that jumps out:
why do we need the flag strict_interpolation_validation?

Same reason as above: to detect errors early.

What would be the cost of just running the regex for any string that passes the initial ${ sniff test?

Running the regex is pointless if we don't do anything when the regex doesn't return a match: the regex is only here to avoid the costly grammar parsing in common situations.

omry · 2021-04-08T03:13:32Z

You misunderstood me.
I meant this:

    if isinstance(value, str) and "${" in value:
        # First try the cheap regex matching that detects common interpolations.
        if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
            # If no match, do the more expensive grammar parsing to detect errors.
            parse(value)
        return ValueKind.INTERPOLATION

omry · 2021-04-08T03:19:24Z

The bandaid fix is to set strict_interpolation_validation=False at

Now I am confused.
I though the solution was to be more strict, not less strict.

If this is not an interpolation, why does the the strictest _get_value_kind says that it is?
You are already parsing the value. at this point you can check if it's an interpolation or not based on what it is, and not by the fact that it passed parsing.

odelalleau · 2021-04-08T12:15:48Z

You misunderstood me.
I meant this:

    if isinstance(value, str) and "${" in value:
        # First try the cheap regex matching that detects common interpolations.
        if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
            # If no match, do the more expensive grammar parsing to detect errors.
            parse(value)
        return ValueKind.INTERPOLATION

What's your objective here? If we remove the strict_interpolation_validation flag then we always validate interpolation syntax, which (i) doesn't help with this issue, and (ii) incurs a performance cost.

The bandaid fix is to set strict_interpolation_validation=False at

Now I am confused.
I though the solution was to be more strict, not less strict.

It is to be less strict so that we don't notice that the string is a badly formed interpolation.

If this is not an interpolation, why does the the strictest _get_value_kind says that it is?
You are already parsing the value. at this point you can check if it's an interpolation or not based on what it is, and not by the fact that it passed parsing.

I'm not sure I understand your last sentence, but I think the confusion comes from me not explaining well enough what is happening. Let me write in more details the steps that occur when we run this code:

cfg = OmegaConf.create({"x": r"\${not an interpolation}"})
cfg.x

When the cfg.x statement is executed, the x node is fetched and its _maybe_resolve_interpolation() method is called
In _maybe_resolve_interpolation() we detect that it is an interpolation (escaped interpolations are considered as a special kind of string interpolation, this is required so that they can be properly un-escaped)
Then we obtain the parse tree with the grammar and call _resolve_interpolation_from_parse_tree()
This calls resolve_parse_tree() which runs the grammar visitor on the string "\${not an interpolation}" and returns the resulting string resolved = "${not an interpolation}" (so far so good)
The next step in _resolve_interpolation_from_parse_tree() is to call _validate_and_convert_interpolation_result(). In that function, we detect that resolved is a string and thus has to be wrapped into a node, so we call _wrap_interpolation_result() on it, which calls _node_wrap()
Since the type passed to _node_wrap() is Any (the type of the x node), _node_wrap() attempts to create an AnyNode whose value is resolved, i.e. the string "${not an interpolation}"
That's where we have a problem: the constructor of AnyNode calls the constructor of ValueNode, which calls ValueNode._set_value(), which checks if the input is an interpolation with strict_interpolation_validation=True... which raises an exception because "${not an interpolation}" is invalid syntax for an interpolation.

omry · 2021-04-08T18:58:38Z

What's your objective here? If we remove the strict_interpolation_validation flag then we always validate interpolation syntax, which (i) doesn't help with this issue, and (ii) incurs a performance cost.

The objective here was to to always be strict, which would solve the wrong problem if this failing by not being strict enough.
The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.
In any case, it's not important.

Going to look at the second part of your answer now.

odelalleau · 2021-04-08T19:37:04Z

The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.

Just to be sure it's clear, the performance cost would come from the various calls to _get_value_kind() that occur across the codebase after assignment, that now would all trigger the regex (and possibly full grammar) checks, which are slower than just checking for the presence of ${. With the current implementation we only run those checks on assignment.

Also all strings that contain ${ must be considered interpolations (i.e, "very few" is actually "zero").

omry · 2021-04-08T19:40:15Z

In _maybe_resolve_interpolation() we detect that it is an interpolation (escaped interpolations are considered as a special kind of string interpolation, this is required so that they can be properly un-escaped)

This could be one way to attack this issue.
Can you give more examples of what considering escaped interpolations as special kind of interpolations buys us?

Went through the rest and ran it in the debugger.
I think this boils down to get_value_kind raising an exception when confronted with a malformed interpolation instead of treating it as a value.

This is a change from the behavior of OmegaConf 2.0, which allows malformed interpolations as values:
(In fact, this is an undocumented breaking change).

In [1]: from omegaconf import *

In [2]: __version__
Out[2]: '2.0.6'

In [3]: cfg = OmegaConf.create({"x": r"${n"})

In [4]: cfg.x
Out[4]: '${n'

We have discussed it before, and I suggested that get_value_kind() does not raise exceptions.

Changing _get_value_kind interpolation detection to this:

    if isinstance(value, str) and "${" in value:
        if strict_interpolation_validation:
            # First try the cheap regex matching that detects common interpolations.
            if SIMPLE_INTERPOLATION_PATTERN.match(value) is None:
                # If no match, do the more expensive grammar parsing to detect errors.
                try:
                    parse(value)
                except GrammarParseError:
                    return ValueKind.VALUE

Fixes this particular issue and is breaking a handful of tests because as expected some things are not considered errors now.

My preference is actually to retain the old behavior and consider broken interpolations as values.

omry · 2021-04-08T19:42:11Z

The performance cost should be negligible because very few strings would pass the ${ check without actually having an interpolation.

Just to be sure it's clear, the performance cost would come from the various calls to _get_value_kind() that occur across the codebase after assignment, that now would all trigger the regex (and possibly full grammar) checks, which are slower than just checking for the presence of ${. With the current implementation we only run those checks on assignment.

"that now would all trigger the regex" if the string contains ${.

odelalleau · 2021-04-08T22:22:59Z

Can you give more examples of what considering escaped interpolations as special kind of interpolations buys us?

It buys us the un-escaping of escaped interpolations. If we said "it's just a regular string" then the following code would return "\${y}" instead of "${y}":

cfg = OmegaConf.create({"x": r"\${y}"})
cfg.x

Note also that even if we make it so that escaped interpolations aren't considered interpolations anymore, it may help in the specific example I gave here, but the same problem would occur if for instance a resolver returns a string that looks like an interpolation:

OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError

I guess what might work is:

Delaying the un-escaping of \${ to the call of ._value() instead of doing it in the grammar visitor
Saying that a string that doesn't contain any regular (i.e, not escaped) interpolation is not an interpolation according to get_value_kind()
Saying that the resolver example I just wrote above is invalid, and one should instead return r"\${trigger crash}" in the resolver (in other words, if the result of resolving an interpolation is a string, it is not allowed to be an interpolation according to get_value_kind())

I'm not 100% sure I like it because it makes the nodes' internals a bit more complex and adds extra processing on each access, but at first glance it seems doable. What do you think?

(edit: actually the "extra processing on each access" isn't a problem I think: it's cheaper than resolving interpolations, the extra cost for regular strings should be minor, and we could cache the result => I'm starting to like this direction, though there remain design questions to be resolved)

(edit2: having a second look at it, one potential implementation concern is that currently get_value_kind() analyzes the string returned by ._value(), so if we un-escape in ._value() then it wouldn't be able to differentiate between escaped vs non-escaped interpolations. We can probably work around this e.g. by adding a flag to ._value() to disable the un-escaping, but it would start getting a big ugly...).

My preference is actually to retain the old behavior and consider broken interpolations as values.

Just to understand, are you saying that it should be the case only when strict_interpolation_validation is True? (as in your code snippet). I'm not comfortable having a different output depending on the value of strict_interpolation_validation -- it might happen to solve this specific problem but it could bite us later.

If we do it also when strict_interpolation_validation is False then I see at least the following issues:

strict_interpolation_validation is actually pointless now because we always need to do this strict validation => there is a performance cost
More importantly, there would be no explicit error raised anymore for badly formed interpolations, instead the program may behave in surprising ways because what was supposed to be an interpolated value suddenly turned into a weird string. Since in the large majority of cases such badly formed interpolations are user error, I think there is value in catching them (I think the OmegaConf 2.0 behavior was pretty bad in that regard -- I actually remember having a few "WTF" moments back when working on the new grammar and trying stuff with 2.0 because I wrote an interpolation that wasn't recognized as such and processed seamlessly as a string instead).
In the context of this issue, the "under-the-hood" behavior would be different depending on whether one writes {"x": "\\${not an interpolation}"} (the node obtained when dereferencing x would be considered as not being an interpolation) vs {"x": "\\${not_an_interpolation}"} (now it would be considered as being an interpolation, since ${not_an_interpolation} would pass the grammar parse step). I'm not sure if there would be any user-facing consequence to that, but it's definitely the kind of subtle difference that can cause problems down the road. In other words, IMO it remains another kind of "bandaid" fix and not a viable long term solution to this specific issue.

"that now would all trigger the regex" if the string contains ${.

If your point is that we only pay a price for interpolations, then yes I agree. But if we don't care why did we bother with the regex version in the first place? (it's only there to speed-up the processing of interpolations).

odelalleau · 2021-04-25T12:42:12Z

@omry just realized we hadn't converged on this one, can you please have a look when you get a chance?

omry · 2021-04-25T16:22:42Z

Yes, I postponed this because having multiple parallel design discussions was too time consuming and I felt it's best to focus on one at a time.
I am planning to get back to this one next week.

…lation Fixes omry#666

Fixes omry#666

omry · 2021-05-10T17:53:00Z

Note also that even if we make it so that escaped interpolations aren't considered interpolations anymore, it may help in the specific example I gave here, but the same problem would occur if for instance a resolver returns a string that looks like an interpolation:
OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError

What is the desired behavior here? that the interpolation is not resolved?
Generally this seems like an odd thing to do.

I guess what might work is:

Delaying the un-escaping of \${ to the call of ._value() instead of doing it in the grammar visitor

Saying that a string that doesn't contain any regular (i.e, not escaped) interpolation is not an interpolation according to get_value_kind()

Saying that the resolver example I just wrote above is invalid, and one should instead return r"\${trigger crash}" in the resolver (in other words, if the result of resolving an interpolation is a string, it is not allowed to be an interpolation according to get_value_kind())

I'm not 100% sure I like it because it makes the nodes' internals a bit more complex and adds extra processing on each access, but at first glance it seems doable. What do you think?

It seems like a reasonable solution.

(edit: actually the "extra processing on each access" isn't a problem I think: it's cheaper than resolving interpolations, the extra cost for regular strings should be minor, and we could cache the result => I'm starting to like this direction, though there remain design questions to be resolved)

(edit2: having a second look at it, one potential implementation concern is that currently get_value_kind() analyzes the string returned by ._value(), so if we un-escape in ._value() then it wouldn't be able to differentiate between escaped vs non-escaped interpolations. We can probably work around this e.g. by adding a flag to ._value() to disable the un-escaping, but it would start getting a big ugly...).

We can resolve it by providing an API to access the raw value to be used by lower level functions like get_value_kind() (it just have it access the field directly instead of through _value()).

My preference is actually to retain the old behavior and consider broken interpolations as values.

Just to understand, are you saying that it should be the case only when strict_interpolation_validation is True? (as in your code snippet). I'm not comfortable having a different output depending on the value of strict_interpolation_validation -- it might happen to solve this specific problem but it could bite us later.

If we do it also when strict_interpolation_validation is False then I see at least the following issues:

strict_interpolation_validation is actually pointless now because we always need to do this strict validation => there is a performance cost

As we have discussed several times, the computation result can potentially be cached to minimize the damage.

More importantly, there would be no explicit error raised anymore for badly formed interpolations, instead the program may behave in surprising ways because what was supposed to be an interpolated value suddenly turned into a weird string. Since in the large majority of cases such badly formed interpolations are user error, I think there is value in catching them (I think the OmegaConf 2.0 behavior was pretty bad in that regard -- I actually remember having a few "WTF" moments back when working on the new grammar and trying stuff with 2.0 because I wrote an interpolation that wasn't recognized as such and processed seamlessly as a string instead).

Yes, this is a bigger problem now that that grammar is significantly more complicated.

In the context of this issue, the "under-the-hood" behavior would be different depending on whether one writes {"x": "\\${not an interpolation}"} (the node obtained when dereferencing x would be considered as not being an interpolation) vs {"x": "\\${not_an_interpolation}"} (now it would be considered as being an interpolation, since ${not_an_interpolation} would pass the grammar parse step). I'm not sure if there would be any user-facing consequence to that, but it's definitely the kind of subtle difference that can cause problems down the road. In other words, IMO it remains another kind of "bandaid" fix and not a viable long term solution to this specific issue.

I am not too worried about this one.

Going to check your PR now.

odelalleau · 2021-05-11T16:55:38Z

OmegaConf.register_new_resolver("oops", lambda: "${trigger crash}")
cfg = OmegaConf.create({"x": "${oops:}"})
cfg.x  # GrammarParseError
What is the desired behavior here? that the interpolation is not resolved?

Correct,x should be set to the string returned by the resolver, whatever it is. We shouldn't care if it looks like an interpolation (if we cared about resolving interpolations we would probably use instead something like ${oc.decode:${oops:}})

We can resolve it by providing an API to access the raw value to be used by lower level functions like get_value_kind() (it just have it access the field directly instead of through _value()).

Yeah, whether it's a new flag, a new method, or just doing a direct access, all are pretty much equivalent (IMO), i.e., it should work but it looks a bit clumsy.

Fixes omry#666

Fix crash with "interpolation-like" strings from interpolations This commit introduces a new node type `InterpolationResultNode` that systematically wraps interpolation results that either (a) are not already nodes, or (b) need to be converted. Fixes #666

odelalleau added the bug Something isn't working label Apr 7, 2021

odelalleau added this to the OmegaConf 2.1 milestone Apr 7, 2021

odelalleau self-assigned this Apr 7, 2021

omry mentioned this issue Apr 29, 2021

New syntax for nested quoted values #695

Merged

odelalleau mentioned this issue May 5, 2021

Follow-up plan regarding changes to resolvers #535

Closed

22 tasks

odelalleau added a commit to odelalleau/omegaconf that referenced this issue May 7, 2021

Fix crash when an interpolation result looks like a malformed interpo…

6feb2d2

…lation Fixes omry#666

odelalleau added a commit to odelalleau/omegaconf that referenced this issue May 7, 2021

Fix crash with "interpolation-like" strings from interpolations

4e0b6dc

Fixes omry#666

odelalleau mentioned this issue May 7, 2021

Fix crash with "interpolation-like" strings from interpolations #709

Merged

odelalleau added a commit to odelalleau/omegaconf that referenced this issue May 11, 2021

Fix crash with "interpolation-like" strings from interpolations

1a5bbf9

Fixes omry#666

odelalleau closed this as completed in #709 May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

odelalleau commented Apr 7, 2021

omry commented Apr 8, 2021 •

edited

Loading

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021 •

edited

Loading

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021 •

edited

Loading

odelalleau commented Apr 25, 2021

omry commented Apr 25, 2021

omry commented May 10, 2021

odelalleau commented May 11, 2021 •

edited

Loading

Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

Escaped interpolations trigger a GrammarParseError if they don't follow the expected interpolation grammar #666

Comments

odelalleau commented Apr 7, 2021

omry commented Apr 8, 2021 • edited Loading

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021 • edited Loading

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021

omry commented Apr 8, 2021

omry commented Apr 8, 2021

odelalleau commented Apr 8, 2021 • edited Loading

odelalleau commented Apr 25, 2021

omry commented Apr 25, 2021

omry commented May 10, 2021

odelalleau commented May 11, 2021 • edited Loading

omry commented Apr 8, 2021 •

edited

Loading

omry commented Apr 8, 2021 •

edited

Loading

odelalleau commented Apr 8, 2021 •

edited

Loading

odelalleau commented May 11, 2021 •

edited

Loading