Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OmegaConf 2.1] Grammar for parsing of interpolations #321

Closed
wants to merge 52 commits into from

Conversation

odelalleau
Copy link
Collaborator

This is heavily WIP. Just putting it out there for visibility and to centralize discussions.

@odelalleau
Copy link
Collaborator Author

A few comments about the main differences wrt Hydra's grammar (https://github.com/facebookresearch/hydra/blob/master/hydra/grammar/Override.g4):

  1. I changed a bit the definition of INT to (a) avoid parsing strings like 1__000 or 10_ which are not valid int representations in Python, and (b) adding support for positive ints like +3

  2. For FLOAT I added support for _ as in 1_0e1_0

  3. The primitive type is quite different. I should probably have found another name but it seemed like a good fit for, well, primitive types like ints / floats / strings / bools. I wasn't sure about the motivation for having all these operators found in Hydra's primitive type, so I went a different path and never felt like I needed them.

  4. The big change is related to interpolations. They can't just be handled through lexer rules since I need a proper parsing tree. I actually started from Hydra's lexer rules but ran into some issues, for instance ${foo:1} is not allowed for some reason. I ended up re-writing everything in the parser based on the test strings I was coming up with.

A couple more points not related to differences with Hydra:

  1. While looking at Hydra's grammar I was wondering about the motivation for parsing everything, when it seems like e.g. just knowing that an override is foo=... with foo a config variable, then you could just forward ... to OmegaConf and let it take care of it. What am I missing here?

  2. The grammar I wrote should support a new kind of interpolations, which I'm tentatively naming structured interpolations, of the kind [${foo}, ${bar}] or {foo: ${foo}, bar: ${bar}} (these are just examples, you get the idea). Any objection to introducing them? (or preference for another name)?

@omry
Copy link
Owner

omry commented Jul 26, 2020

A few comments about the main differences wrt Hydra's grammar (https://github.com/facebookresearch/hydra/blob/master/hydra/grammar/Override.g4):

  1. I changed a bit the definition of INT to (a) avoid parsing strings like 1__000 or 10_ which are not valid int representations in Python, and (b) adding support for positive ints like +3

Cool. I am a bit torn about +3 because it's not really useful.

  1. For FLOAT I added support for _ as in 1_0e1_0

Ok, although I am not sure this is too useful: those numbers generally too small to justify _.

  1. The primitive type is quite different. I should probably have found another name but it seemed like a good fit for, well, primitive types like ints / floats / strings / bools. I wasn't sure about the motivation for having all these operators found in Hydra's primitive type, so I went a different path and never felt like I needed them.

I basically wanted to whitelist special characters there to ensure there is no ambiguity with the rest of the grammar.

  1. The big change is related to interpolations. They can't just be handled through lexer rules since I need a proper parsing tree.

Yup, I believe I mentioned it at one of the messages when we discussed this.

I actually started from Hydra's lexer rules but ran into some issues, for instance ${foo:1} is not allowed for some reason. I ended up re-writing everything in the parser based on the test strings I was coming up with.
I will have a follow-up with a proposed simpler grammar for the interpolations.

A couple more points not related to differences with Hydra:

  1. While looking at Hydra's grammar I was wondering about the motivation for parsing everything, when it seems like e.g. just knowing that an override is foo=... with foo a config variable, then you could just forward ... to OmegaConf and let it take care of it. What am I missing here?

You are specifically why I even have interpolation parsing in the Hydra grammar, right?
yes, in principle it's not required -but I want to be 100% sure my grammar will conflict with interpolations (for example, dictionary looks very similar to interpolation).
As an added benefit, it validates that the interpolations are well formed.

  1. The grammar I wrote should support a new kind of interpolations, which I'm tentatively naming structured interpolations, of the kind [${foo}, ${bar}] or {foo: ${foo}, bar: ${bar}} (these are just examples, you get the idea). Any objection to introducing them? (or preference for another name)?

Can you explain more how those are different than the current interpolations? I think I know what you mean but from those examples alone I can't be sure.

"${foo:{'a': 0}}",
"{${foo}: ${bar}}",
"${foo:{${bar}: 0, ${baz}: 1}, 2}_abc",
"{ab_${foo}: c_${baz}}",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is too much at this point, I am not expecting people to register so many functions that this is needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm doing a pass through all unresolved conversations (I'll let you resolve them when you are good).
The syntax ${ab_${foo}:...} is not allowed anymore by the grammar, so this should be ok now.

@omry
Copy link
Owner

omry commented Jul 26, 2020

https://gist.github.com/a76e9c4cd076bf10804d8287814883ef

overrides_file parses this input:

a=${a}
a=${a.b.c}
a=${env:foo}
a=foo_${bar}
a=foo_${bar.${zonk}}
a=foo_${bar.1}
a=foo_${func:a,b}
a=${func:1}
a=${func:true}
a=${func:3.14}
a=${func:'true'}
a=${func:{a:10}}
a=${func:{a:[1,2,3]}}
a=${func:{a:{a:10}}}
a=${env:${x},${y}}
a=${env:[1,2,3],[1,2,3]}
a=${env:{a:10},{b:20}}
a=${foo:10,str,3.14,true,false,nan,inf,[1,2,3],${foo:bar},'quoted',"quoted",'a,b,c'}

@omry
Copy link
Owner

omry commented Jul 26, 2020

The core of the change is switching from the lexer token to a parser rule:

interpolation:
      '${' primitive '}'                            // simple
    | '${' ID ':' element  (',' element)* '}';  // custom resolver

@odelalleau
Copy link
Collaborator Author

Cool. I am a bit torn about +3 because it's not really useful.

Probably not (same for _ in floats). I was just trying to match what worked when calling int() or float() on a string, for consistency with Python (it's not like these make the grammar more complex to read).

I basically wanted to whitelist special characters there to ensure there is no ambiguity with the rest of the grammar.

Ok I think I see what you mean, though I'd need to see specific examples of where it's useful to be sure.

You are specifically why I even have interpolation parsing in the Hydra grammar, right?
yes, in principle it's not required -but I want to be 100% sure my grammar will conflict with interpolations (for example, dictionary looks very similar to interpolation).
As an added benefit, it validates that the interpolations are well formed.

Not just interpolations. Why even parse dictionaries in the first place? Shouldn't it be handled by OmegaConf?

  1. The grammar I wrote should support a new kind of interpolations, which I'm tentatively naming structured interpolations, of the kind [${foo}, ${bar}] or {foo: ${foo}, bar: ${bar}} (these are just examples, you get the idea). Any objection to introducing them? (or preference for another name)?

Can you explain more how those are different than the current interpolations? I think I know what you mean but from those examples alone I can't be sure.

Currently there are two main types of interpolations, based on your previous explanation in #308 (comment):

  • Simple interpolations (${foo.bar}, i.e. anything that starts with ${ and ends with a matching })
  • String interpolations (foo_${bar}, i.e. concatenation of strings)

The new interpolations I'm proposing don't fit in either of these categories, which is why I'm suggesting to create a new category for them. They would be any list or dict containing an interpolation (possibly deeply nested, e.g. [[[[[[${foo}]]]]]]). I'm saying only "list or dict" because I don't see any other type that could contain interpolations (besides strings, that are already covered by string interpolations), but if there is they could be supported as well.

I'm expecting that these new interpolations should come "for free" once the grammar parsing code is complete (since we already need to be able to parse lists / dicts as arguments of custom resolvers for instance).

@odelalleau
Copy link
Collaborator Author

https://gist.github.com/a76e9c4cd076bf10804d8287814883ef
overrides_file parses this input: (...)

I ran it against my test strings and ran into some failure cases. Note that I consider it a failure when it goes through the reportAttemptingFullContext() or reportContextSensitivity() warnings (locally I replaced this warnings by errors to be sure to catch them in my tests). Here are my failing strings:

a=[[1, 2], [3, 4]]
a={"a": 0, "b": 1}
a={null: [0, 3.14, false], true: {"a": [0, 1, 2], "b": {}}}
a=${foo:bar}
a=${${foo}:bar,3.14,${baz:null}}
a=${foo:{1:${bar}, 3: 4}}
a=${foo:'x='${x},'y='${y}}
a=${foo:10,str,3.14,true,false,nan,inf,[1,2,3], ${foo:bar},'quoted', \"quoted\", 'a,b,c'}
a={null: [0, 3.14, false]}
a=${foo:{'a': 0}}
a={${foo}: ${bar}}
a=${foo:{${bar}: 0, ${baz}: 1}, 2}_abc
a={ab_${foo}: c_${baz}}
a={foo: ${foo}, ${bar}: bar}
a={foo: ${foo}, ${bar}:bar}

Also among your strings there are actually some failures (related to the above-mentioned warnings):

a=${env:foo}
a=${func:1}
a=${func:true}
a=${func:3.14}
a=${foo:10,str,3.14,true,false,nan,inf,[1,2,3],${foo:bar},'quoted',\"quoted\",'a,b,c'}

On the other hand all seem to run fine through my current grammar. At this point since what I have now seems to provide a better coverage of potential use cases, I'm going to go ahead with it for now so as to have something working, then once we have a set of tests it'll be easier to make things simpler (I'll keep in mind #321 (comment) though, as dropping support for this could help simplify parts of my current grammar).

@omry
Copy link
Owner

omry commented Jul 28, 2020

Cool. I am a bit torn about +3 because it's not really useful.

Probably not (same for _ in floats). I was just trying to match what worked when calling int() or float() on a string, for consistency with Python (it's not like these make the grammar more complex to read).

It makes it inconsistent with Hydra which is much more important to me than Python inconsistency.
I am open to changing the grammar in both to support it.

I basically wanted to whitelist special characters there to ensure there is no ambiguity with the rest of the grammar.

Ok I think I see what you mean, though I'd need to see specific examples of where it's useful to be sure.

for example, comma ',' is treated very differently by Hydra in the context of a simple sweep.
We absolutely cannot have unquoted , in a string in Hydra. other things like [] and {} are also treated specially, as well as quotes.

You are specifically why I even have interpolation parsing in the Hydra grammar, right?
yes, in principle it's not required -but I want to be 100% sure my grammar will conflict with interpolations (for example, dictionary looks very similar to interpolation).
As an added benefit, it validates that the interpolations are well formed.

Not just interpolations. Why even parse dictionaries in the first place? Shouldn't it be handled by OmegaConf?

In Hydra 0.11 Hydra was delegating to OmegaConf to perform the merge with the command line.
Even this was not great, for example sweeping over lists before was impossible.
Hydra 1.0 is adding a ton of dedicated syntax to the command line that goes beyond the scope of OmegaConf.
There is no choice but to parse it there now.
Refer to this to see what is already implemented and to this and to the doc linked there to see what is coming soon.

Another answer is that OmegaConf currently uses yaml to parse the dicts and lists and it's not nearly good enough.
for example, these will not parse:

# interpolations does are not legal in yaml/json inside containers:
{a:${b}} 
[${a.b.c}]

# Simple sweep is not something yaml supports:
key=a,b,c
key=[1,2],[3,4]

The motivation to add a new grammar for Hydra started with these use cases.
Keep this in mind when asking "why it it not in OmegaConf".

  1. The grammar I wrote should support a new kind of interpolations, which I'm tentatively naming structured interpolations, of the kind [${foo}, ${bar}] or {foo: ${foo}, bar: ${bar}} (these are just examples, you get the idea). Any objection to introducing them? (or preference for another name)?
    Can you explain more how those are different than the current interpolations? I think I know what you mean but from those examples alone I can't be sure.

Currently there are two main types of interpolations, based on your previous explanation in #308 (comment):

* Simple interpolations (`${foo.bar}`, i.e. anything that starts with `${` and ends with a matching `}`)

* String interpolations (`foo_${bar}`, i.e. concatenation of strings)

Well, there is also ${function:p1,p2,...} which is a custom resolver interpolation.

The new interpolations I'm proposing don't fit in either of these categories, which is why I'm suggesting to create a new category for them. They would be any list or dict containing an interpolation (possibly deeply nested, e.g. [[[[[[${foo}]]]]]]).

This fits perfectly fine if you consider that both interpolation types you outlined are just like any other primitives and they can be elements of dict or list.
See how I did it in my proposal.

@omry
Copy link
Owner

omry commented Jul 28, 2020

https://gist.github.com/a76e9c4cd076bf10804d8287814883ef
overrides_file parses this input: (...)

I ran it against my test strings and ran into some failure cases. Note that I consider it a failure when it goes through the reportAttemptingFullContext() or reportContextSensitivity() warnings (locally I replaced this warnings by errors to be sure to catch them in my tests). Here are my failing strings:

I did not run it through actual Python code yet so I am not surprised. I am not sure what those mean actually and I was just being extra defensive.
we can see what it would take to clear this.

a=[[1, 2], [3, 4]]
a={"a": 0, "b": 1}
a={null: [0, 3.14, false], true: {"a": [0, 1, 2], "b": {}}}
a=${foo:bar}
a=${${foo}:bar,3.14,${baz:null}}
a=${foo:{1:${bar}, 3: 4}}
a=${foo:'x='${x},'y='${y}}
a=${foo:10,str,3.14,true,false,nan,inf,[1,2,3], ${foo:bar},'quoted', \"quoted\", 'a,b,c'}
a={null: [0, 3.14, false]}
a=${foo:{'a': 0}}
a={${foo}: ${bar}}
a=${foo:{${bar}: 0, ${baz}: 1}, 2}_abc
a={ab_${foo}: c_${baz}}
a={foo: ${foo}, ${bar}: bar}
a={foo: ${foo}, ${bar}:bar}

Also among your strings there are actually some failures (related to the above-mentioned warnings):

a=${env:foo}
a=${func:1}
a=${func:true}
a=${func:3.14}
a=${foo:10,str,3.14,true,false,nan,inf,[1,2,3],${foo:bar},'quoted',\"quoted\",'a,b,c'}

On the other hand all seem to run fine through my current grammar. At this point since what I have now seems to provide a better coverage of potential use cases, I'm going to go ahead with it for now so as to have something working, then once we have a set of tests it'll be easier to make things simpler (I'll keep in mind #321 (comment) though, as dropping support for this could help simplify parts of my current grammar).

What you have now may work better than the grammar I wrote in 20 minutes, but it's also much more complicated and hard to reason about.
I will take a second look at the grammar I created to see what it will take to fix it.

Unfortunately I am in the middle of another diff working on the grammar so it will have to wait. In the mean time I think you can make progress on the testing and build integration.

@odelalleau
Copy link
Collaborator Author

The motivation to add a new grammar for Hydra started with these use cases.
Keep this in mind when asking "why it it not in OmegaConf".

Thanks for the additional context, it's much clearer now!

Well, there is also ${function:p1,p2,...} which is a custom resolver interpolation.

I had counted this one as a "simple interpolation" ("anything that starts with ${ and ends with a matching }"), but yes.

The new interpolations I'm proposing don't fit in either of these categories, which is why I'm suggesting to create a new category for them. They would be any list or dict containing an interpolation (possibly deeply nested, e.g. [[[[[[${foo}]]]]]]).

This fits perfectly fine if you consider that both interpolation types you outlined are just like any other primitives and they can be elements of dict or list.

Ok, so actually after looking at some tests it appears that these may already be supported, though as you said Yaml doesn't like them. I thought it wasn't possible because the doc doesn't show an example of this. I guess this point is moot then.

@odelalleau
Copy link
Collaborator Author

Unfortunately I am in the middle of another diff working on the grammar so it will have to wait. In the mean time I think you can make progress on the testing and build integration.

No problem, that's what I've been doing. I just reached the point where all current pytest tests pass. Still have to do some major code cleanup and add more tests, then I'll update this PR with a first "working" version. Afterwards we can see how to make it simpler (will be much easier to tweak the grammar once we have a clear set of test interpolation strings to support).

@omry
Copy link
Owner

omry commented Jul 29, 2020

The new interpolations I'm proposing don't fit in either of these categories, which is why I'm suggesting to create a new category for them. They would be any list or dict containing an interpolation (possibly deeply nested, e.g. [[[[[[${foo}]]]]]]).

This fits perfectly fine if you consider that both interpolation types you outlined are just like any other primitives and they can be elements of dict or list.

Ok, so actually after looking at some tests it appears that these may already be supported, though as you said Yaml doesn't like them. I thought it wasn't possible because the doc doesn't show an example of this. I guess this point is moot then.

YAML is a complex thing, it actually has support for JSON as well! it's called Flow style in YAML lingo.
In the command line, which is what the Hydra grammar is targeting you can't use YAML because of the newlines.
Instead you could use flow style for the values, which kinda works until you hit more complicated use cases (like interpolation):

# this is json parsed by yaml
[1,2,3]

# this is also fine, even though it's not legal json (no quotes around strings):
[a,b,c] 

# But this is not legal flow style:
[${foo}]

# even though this is fine for regular YAML:
bar: ${foo}

@odelalleau odelalleau force-pushed the grammar_interpolations branch from 469a2ce to 0804230 Compare July 31, 2020 16:51
@ghost
Copy link

ghost commented Jul 31, 2020

Congratulations 🎉. DeepCode analyzed your code in 3.092 seconds and we found no issues. Enjoy a moment of no bugs ☀️.

👉 View analysis in DeepCode’s Dashboard | Configure the bot

@odelalleau
Copy link
Collaborator Author

Just pushed a major update -- not quite ready for review yet, I need to add more tests and I'll have a few questions as well (will post them later). Also going to make a few more updates to try and get the CircleCI build working.

@odelalleau
Copy link
Collaborator Author

I suggest that you don't look closely at the interpolation parsing code (= the grammar, the _resolve_complex_interpolation() function, and interpolation_parser.py). I definitely want to simplify it further (I actually ended up adding some ugly hacks to work around some issues). I'm first going to finish the list of test cases for you to validate, then once we agree on what should be supported then I'll clean things up.

In the meantime, here are already some questions (in no particular order):

  1. I copied a lot of stuff from Hydra in build_helpers.py. Should I add Hydra's license somewhere?
  2. Does adding antlr's .jar file also require a license? (https://www.antlr.org/license.html)
  3. Related to the build functions in build_helpers.py: I haven't added a test file yet as is done in Hydra. Before doing that, I was wondering about the longer term plan for these build functions that are duplicated between Hydra and OmegaConf: would you remove them from Hydra and call directly the OmegaConf code instead? (in which case I could make them a bit more generic)
  4. I'm curious why the choice of str: for the default resolver name (
    inter_type = ("str:" if inter_type is None else inter_type)[0:-1]
    ). It makes it sound like we would apply the str() function to obtain the final result, which isn't the case. Also someone might actually register an str resolver which could conflict with this. I thought of just getting rid of it but wanted to ask first if there was a reason.
  5. Stack traces are formatted in a way that makes it hard to debug as a developer. Is there a way to get the "real" stack trace? (I ended up just hacking the exception handling code when I ran into such issues, but it's not very convenient as it breaks some other stuff)
  6. My new code makes it so that the input to resolvers are not necessarily strings anymore. This is overall a good thing, but it breaks the caching mechanism when such inputs are not hashable (ex: lists, dicts). For now I just disabled caching in such situations, but would you prefer that I make it work?
  7. The env: resolver uses a decode_primitive() function (
    return decode_primitive(os.environ[key])
    ) to replace strings with other values. That sounds like a bad idea to me, e.g. if your password stored in an env variable is "1234" it will get converted to an integer. You could get funny things with "true" and "false" as well... What about forcing env to return a string (and only accept strings as defaults, now that resolvers can take non-str inputs), and letting the user do the cast explicitly? (either with another agument to env:, or by adding new default resolvers like bool:, int: and float:)
  8. I said you shouldn't look at interpolation_parser.py just yet, but while building the grammar visitor I was wondering whether I should try and have each visitor function (resolving the value of a node in the parse tree) return an OmegaConf Node, or directly use raw data types (int, str, list, etc.). I ended up doing the latter as I never felt like I needed Nodes except for the root of the tree (which evaluates the full interpolation). I was just curious if you had any thoughts on whether using Nodes everywhere might be better.

@odelalleau odelalleau force-pushed the grammar_interpolations branch 2 times, most recently from 26d5a60 to 5374722 Compare July 31, 2020 19:41
@odelalleau
Copy link
Collaborator Author

Just pushed more tests, still not complete but this shows most special cases I've come across. Currently these tests are made to pass even if the behavior is debatable (or clearly wrong), just to show how the code is currently behaving.

Next steps (next week):

  1. Finish adding all test cases I have in mind
  2. Identify which need fixing or are special cases we may want to stop supporting in order to simplify the code

@omry
Copy link
Owner

omry commented Aug 1, 2020

I suggest that you don't look closely at the interpolation parsing code (= the grammar, the _resolve_complex_interpolation() function, and interpolation_parser.py). I definitely want to simplify it further (I actually ended up adding some ugly hacks to work around some issues). I'm first going to finish the list of test cases for you to validate, then once we agree on what should be supported then I'll clean things up.

Sounds good.

In the meantime, here are already some questions (in no particular order):

  1. I copied a lot of stuff from Hydra in build_helpers.py. Should I add Hydra's license somewhere?

No, most of build_helpers from Hydra is copied from Antelope anyway.

  1. Does adding antlr's .jar file also require a license? (https://www.antlr.org/license.html)

It's a good question. Since we are not really distributing it to end users I think the answer is no.

  1. Related to the build functions in build_helpers.py: I haven't added a test file yet as is done in Hydra. Before doing that, I was wondering about the longer term plan for these build functions that are duplicated between Hydra and OmegaConf: would you remove them from Hydra and call directly the OmegaConf code instead? (in which case I could make them a bit more generic)

I don't think so. for now I think each project should keep a copy of those to allow for maximum flexibility.

  1. I'm curious why the choice of str: for the default resolver name (
    inter_type = ("str:" if inter_type is None else inter_type)[0:-1]

    ). It makes it sound like we would apply the str() function to obtain the final result, which isn't the case. Also someone might actually register an str resolver which could conflict with this. I thought of just getting rid of it but wanted to ask first if there was a reason.

Yeah, just an historical implementation detail. feel free to change that. I don't think anything is depending on it.

  1. Stack traces are formatted in a way that makes it hard to debug as a developer. Is there a way to get the "real" stack trace? (I ended up just hacking the exception handling code when I ran into such issues, but it's not very convenient as it breaks some other stuff)

Did you look at the function that raises the exception?

  1. My new code makes it so that the input to resolvers are not necessarily strings anymore. This is overall a good thing, but it breaks the caching mechanism when such inputs are not hashable (ex: lists, dicts). For now I just disabled caching in such situations, but would you prefer that I make it work?

Well, I prefer that it didn't break the caching.
Generally, this is also a breaking change and we need to think what it means for existing code and if it's possible to maintain backward compatibility somehow. (even if temporarily).

  1. The env: resolver uses a decode_primitive() function (
    return decode_primitive(os.environ[key])

    ) to replace strings with other values. That sounds like a bad idea to me, e.g. if your password stored in an env variable is "1234" it will get converted to an integer. You could get funny things with "true" and "false" as well... What about forcing env to return a string (and only accept strings as defaults, now that resolvers can take non-str inputs), and letting the user do the cast explicitly? (either with another agument to env:, or by adding new default resolvers like bool:, int: and float:)

The env resolver is used in the condig:

foo: ${env:PASSWORD}

I can see value in allowing parsed values here (in fact, it should be parsed using the grammar and decode_primitive should be deleted).
If you really wanted to enforce specific types here you can use Structured Configs.

@dataclass
class Config:
  foo : str = II("{env:password})
  1. I said you shouldn't look at interpolation_parser.py just yet, but while building the grammar visitor I was wondering whether I should try and have each visitor function (resolving the value of a node in the parse tree) return an OmegaConf Node, or directly use raw data types (int, str, list, etc.). I ended up doing the latter as I never felt like I needed Nodes except for the root of the tree (which evaluates the full interpolation). I was just curious if you had any thoughts on whether using Nodes everywhere might be better.

Generally, try to keep things as simple as possible.
In this case, parsing directly to node creates an unhealthy dependency from the grammar to the API of OmegaConf itself (I would like that dependency to go the other way).
It's unclear to me if returning the nodes directly would help in any way.
generally speaking, with Hydra I have been creating simple datastructures to be returned by the parse tree in case it does need to return something more sophisticated. Here are some new ones from my current PR.
I think it's fine to return the parsed primitives until you have an actual reason to do otherwise.

Comment on lines +14 to +16
[tool.pytest.ini_options]
addopts = "--import-mode=append"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have never seen this. how come it's not needed for Hydra?

Copy link
Collaborator Author

@odelalleau odelalleau Aug 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That surprised me as well. Just looked into it more closely and this is because of this line in setup.py:

entry_points={"pytest11": ["hydra_pytest = hydra.extra.pytest_plugin"]},

Because of this, the hydra module is imported from its pip install location before pytest prepends the hydra root folder to sys.path (this is pytest's default behavior, see https://docs.pytest.org/en/stable/pythonpath.html). So the tests run against the installed version as intended, but essentially "by luck".

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, nasty. I was wondering if the difference in behavior had anything to do with it.
That pytest plugin is fairly recent (maybe a couple of months).

How did you initially notice it was running against the local version and not the installed version? shouldn't the local version also have the generated parser code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you initially notice it was running against the local version and not the installed version? shouldn't the local version also have the generated parser code?

No when you run pip install . the parser code is not generated locally, ony in the installed version.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running pip install ., the parser is generated while making the sdist wheel which in turn is installed.
why do you need the generated code in this scenario? it's installed and available.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running pip install ., the parser is generated while making the sdist wheel which in turn is installed.
why do you need the generated code in this scenario? it's installed and available.

So, starting from a clean repo, running pip install . generates the grammar code and installs it somewhere in a path that looks like /Users/odelalleau/opt/miniconda3/envs/py38-omega/lib/python3.8/site-packages/omegaconf. However, if you look at the content of ./omegaconf/grammar/gen, this folder is still emtpy.

Without this change to the pytest config, pytest would use ./omegaconf instead of /Users/odelalleau/opt/miniconda3/envs/py38-omega/lib/python3.8/site-packages/omegaconf and thus wouldn't be able to find the generated files.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, starting from a clean repo, running pip install . generates the grammar code and installs it somewhere in a path that looks like /Users/odelalleau/opt/miniconda3/envs/py38-omega/lib/python3.8/site-packages/omegaconf. However, if you look at the content of ./omegaconf/grammar/gen, this folder is still emtpy.

This is what I am expecting.
But for starters, the current logic in nox (master) is just running pytest in the dir.
In Hydra it also pip install . (or pip install -e .).
Can you try that and see if eliminates the need for this hack?
(again, trying to control the creeping complexity here).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's continue this thread in #321 (comment)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's not. this is a different issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's not. this is a different issue.

I'll follow-up about this comment in the new upcoming PR.

@odelalleau
Copy link
Collaborator Author

Thanks for all answers! A few follow-ups:

  1. My new code makes it so that the input to resolvers are not necessarily strings anymore. This is overall a good thing, but it breaks the caching mechanism when such inputs are not hashable (ex: lists, dicts). For now I just disabled caching in such situations, but would you prefer that I make it work?

Well, I prefer that it didn't break the caching.
Generally, this is also a breaking change and we need to think what it means for existing code and if it's possible to maintain backward compatibility somehow. (even if temporarily).

Ok I'll see how to do both (keep caching + maintain backward compatibility).

(in fact, it should be parsed using the grammar and decode_primitive should be deleted).

Ok will do that.

@odelalleau
Copy link
Collaborator Author

Tests are still WIP but one thing I'd like to clarify asap is the handling of top-level strings, because it has a significant impact on the design of the grammar, and for now I hacked around something but it definitely needs fixing. By "top-level strings" I mean strings that are not within an interpolation, as in:

  1. http://${server}:${port}
  2. this is the year ${year}
  3. my_list:${a}, ${b}, ${c}
  4. !@#$${foo} %^&*()${bar};,.:[]{}
  5. "I may like "${like}' but not '${dislike}
  6. "This is not an interpolation: ${fake}"

Initially I was parsing all strings the same way, regardless of whether they were top-level or within an interpolation. However this meant only a subset of characters were allowed, and one had to escape commas (so for instance ex. 3-4 didn't work unless adding quotes as in ex. 5).

My main question is whether we should keep letting people use pretty much any kind of top-level string in string interpolations, of force them to use quotes (as in ex. 5) so as to have the same parsing logic both within and outside of interpolations. The former is more convenient for the user, but the latter makes parsing easier (since top-level is not a special case anymore).

@omry
Copy link
Owner

omry commented Aug 2, 2020

  1. My new code makes it so that the input to resolvers are not necessarily strings anymore. This is overall a good thing, but it breaks the caching mechanism when such inputs are not hashable (ex: lists, dicts). For now I just disabled caching in such situations, but would you prefer that I make it work?

Well, I prefer that it didn't break the caching.
Generally, this is also a breaking change and we need to think what it means for existing code and if it's possible to maintain backward compatibility somehow. (even if temporarily).

Ok I'll see how to do both (keep caching + maintain backward compatibility).

Once you have some ideas about possible backward compatibility paths let's discuss them.
One idea is to introduce a new parameter to register, variables_as_strings, set the default value to None.
If the value at runtime is None, issue a deprecation warning and ask the user to make a choice explaining that the default value will become False in a future version.

(in fact, it should be parsed using the grammar and decode_primitive should be deleted).

Ok will do that.

We can have a few followups for the new grammar.
In particular, parsing dot path can switch to it, and we can more easily support mixing dot notation and dict notation:

foo:
  bar:
    - zonk: [a,b,c]
foo.bar[0][zonk].0
-> a

There could also be a few other things that can benefit from the grammar.

@omry
Copy link
Owner

omry commented Aug 2, 2020

Tests are still WIP but one thing I'd like to clarify asap is the handling of top-level strings, because it has a significant impact on the design of the grammar, and for now I hacked around something but it definitely needs fixing. By "top-level strings" I mean strings that are not within an interpolation, as in:

  1. http://${server}:${port}
  2. this is the year ${year}
  3. my_list:${a}, ${b}, ${c}
  4. !@#$${foo} %^&*()${bar};,.:[]{}
  5. "I may like "${like}' but not '${dislike}
  6. "This is not an interpolation: ${fake}"

Initially I was parsing all strings the same way, regardless of whether they were top-level or within an interpolation. However this meant only a subset of characters were allowed, and one had to escape commas (so for instance ex. 3-4 didn't work unless adding quotes as in ex. 5).

My main question is whether we should keep letting people use pretty much any kind of top-level string in string interpolations, of force them to use quotes (as in ex. 5) so as to have the same parsing logic both within and outside of interpolations. The former is more convenient for the user, but the latter makes parsing easier (since top-level is not a special case anymore).

The original grammar you based this on is the command line grammar in Hydra, that has some constraints that are not present in yaml files (or strings in Python).
The primary usage of this grammar is to parse interpolations and strings containing interpolations, not to parse the command line or to parse the full yaml file, but at most to parse a single string value there that may contain a simple interpolation, zero or more string interpolations and zero or more custom resolvers.

There should generally be no limitations on anything outside of the ${the interpolation}.
With context aware grammars, you can easily have a different context when parsing the interpolation that is more restrictive than what you have outside.
This means that when passing interpolations from the command line, some things may have to be escaped or quoted, but when they get to this parsing logic they should be after unquoting/unescaping.

Quoting interpolations in an interesting question. thinking of it, I don't think it should be possible to bulk-escape an interpolation by quoting. quoting is used in many contexts.
for example:

"What is your name", he asked?
"My name is ${first_name}", I said.

I think escaping the $ should be enough to prevent interpolation expansion:

HOME: moonbase_alpha
cmd: python foo.py --home=/${HOME}

should be interpreted as

python foo.py --home=${HOME}

and not as

python foo.py --home=moonbase_alpha

@odelalleau
Copy link
Collaborator Author

Once you have some ideas about possible backward compatibility paths let's discuss them.
One idea is to introduce a new parameter to register, variables_as_strings, set the default value to None.
If the value at runtime is None, issue a deprecation warning and ask the user to make a choice explaining that the default value will become False in a future version.

Sounds like a good plan!

Regarding the cache mechanism, I intend to convert lists and dicts to tuples to compute the key. I'm not planning to sort the dicts unless you think that'd be better (since some functions' output may depend on the ordering of keys).

We can have a few followups for the new grammar.
In particular, parsing dot path can switch to it, and we can more easily support mixing dot notation and dict notation:

foo:
  bar:
    - zonk: [a,b,c]
foo.bar[0][zonk].0
-> a

Marking that for the future but not planning to add it in first iteration.

Copy link
Owner

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

half way through [resolver] Add options to give resolvers access to the config

This matches Hydra's override grammar.
Comment on lines +13 to +16
def deps(session, local_install):
session.install("--upgrade", "setuptools", "pip")
session.install("-r", "requirements/dev.txt", ".", silent=True)
extra_flags = ["-e"] if local_install else []
session.install("-r", "requirements/dev.txt", *extra_flags, ".", silent=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not make the noxfile more complicated if we can get away with it.
these kind of creeping complexity is not good.
Is this also a problem with Hydra?
generally I don't care about the coverage of the generated code at all and I think I am excluding it from the coverage reporting in Hydra.

@@ -43,7 +47,7 @@ def coverage(session):

@nox.session(python=PYTHON_VERSIONS)
def lint(session):
deps(session)
deps(session, local_install=True)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if if you can eliminate the local_mode.

Comment on lines -394 to +392
def is_bool(st: str) -> bool:
st = str.lower(st)
return st == "true" or st == "false"
return ret(ValueKind.INTERPOLATION)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, once I am done with the step by step review I will make a final review on the combined diff.
the change I am seeing here is changing the meaning of the function which is why I brought it up.

@@ -362,19 +362,23 @@ def resolve_simple_interpolation(
self,
key: Any,
inter_type: str,
inter_key: str,
inter_key: Tuple[Any, ...],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is probably an existing confusion there.
See if you can clear it up. can you describe how you would split the function and what would be the responsibility of each of the two functions?

Comment on lines 401 to 419
# The `args_as_strings` warning is triggered when the resolver is
# called instead of when it is defined, so as to limit the amount of
# warnings (by skipping warnings when all inputs are strings).
if args_as_strings and any(not isinstance(k, str) for k in key):
non_str_arg = [k for k in key if not isinstance(k, str)][0]
warnings.warn(
f"Resolvers that take non-string arguments should now be registered "
f"with `args_as_strings=False`, and their code should be updated to "
f"ensure it works as expected with non-string arguments. This "
f"warning is raised because resolver '{name}' was registered with "
f"the current default `args_as_strings=True` and received at least "
f"one non-string argument (`{non_str_arg}`). Although we converted "
f"such non-string arguments to strings to preserve backward "
f"compatibility, this behavior is deprecated => please update "
f"resolver '{name}' as described above. Alternatively, you may "
f"ensure that all its arguments are strings, e.g., by enclosing "
f"them within quotes.",
category=UserWarning,
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For deprecation warnings I often create dedicated GitHub issues and link to them.
Recent example is issue 1089 in the Hydra repo, take a look.

Comment on lines 938 to 952
new_key: List[Any] = [] # will store the new key elements
hashable_item: Any
for idx, item in enumerate(key):
if item is None or isinstance(item, (int, float, bool, str)):
hashable_item = item
elif isinstance(item, list):
hashable_item = _make_hashable(tuple(item))
elif isinstance(item, tuple):
hashable_item = _make_hashable(item)
elif isinstance(item, dict):
# We sort the dictionary so that the order of keys does not matter.
hashable_item = _make_hashable(tuple(sorted(item.items())))
else:
raise NotImplementedError(f"type {type(item)} cannot be made hashable")
new_key.append(hashable_item)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is complex and bugs in it will result in very subtle behavior changes that will be very hard to identify.
Do you have dedicated tests for it?

Comment on lines +14 to +16
[tool.pytest.ini_options]
addopts = "--import-mode=append"

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, starting from a clean repo, running pip install . generates the grammar code and installs it somewhere in a path that looks like /Users/odelalleau/opt/miniconda3/envs/py38-omega/lib/python3.8/site-packages/omegaconf. However, if you look at the content of ./omegaconf/grammar/gen, this folder is still emtpy.

This is what I am expecting.
But for starters, the current logic in nox (master) is just running pytest in the dir.
In Hydra it also pip install . (or pip install -e .).
Can you try that and see if eliminates the need for this hack?
(again, trying to control the creeping complexity here).

@odelalleau
Copy link
Collaborator Author

Looks like some new CI checks were added but fail with this branch (https://results.pre-commit.ci/run/github/147219819/1603908169.w-VpUbDkREuPfGTp3sgZ-g) any hint as to where I should look to get it fixed?

@omry
Copy link
Owner

omry commented Oct 28, 2020

I disabled the pre-commit.ci, it should not appear in the next push.

@odelalleau
Copy link
Collaborator Author

I disabled the pre-commit.ci, it should not appear in the next push.

Great, thanks!

I pushed a new version centered around our discussions related to register_resolver().

I think it might be easier for you to continue the review if I integrate these changes to older commits. Either starting at d375552 (the commit you stopped half-way through), its parent or its child. Any preference? (or I can leave it like that if you don't mind :) )

@omry
Copy link
Owner

omry commented Nov 16, 2020

I missed your commit and your question, sorry.
I will try to review next week.

If you can reorganize the changes such that the updates to register_resolver in top (or in place of) the old register_resolver commit it would be easier. If it proves difficult let me know.

Since this will require a force push, you might as well also rebase on top of master (there have been numerous changes, hopefully thing that will conflict with your changes).

Copy link
Owner

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I reviewed the newest 4 diffs. no need to shift them around.

@odelalleau
Copy link
Collaborator Author

Okay, I reviewed the newest 4 diffs. no need to shift them around.

Thanks, as a result I didn't do any rebase, but let me know when it's a good time to rebase on top of master (keeping in mind that it'll invalidate the commit hashes I've been referring to, so better do it once everything else is resolved IMO)

@omry
Copy link
Owner

omry commented Nov 18, 2020

Okay. ball is back in my court to resume reviewing where I stopped.

As a result, resolvers are not able anymore to access the parent node,
and the env resolver cannot parse node interpolations anymore.
@omry
Copy link
Owner

omry commented Nov 26, 2020

Okay, there have been enough stacked changes that I feel like reviewing in order no longer makes sense.

can you create a new PR rebased on master with all the changes in this one squashed?
I can review it from scratch as one unit.

@odelalleau
Copy link
Collaborator Author

can you create a new PR rebased on master with all the changes in this one squashed?
I can review it from scratch as one unit.

Will do, closing this one then!

@odelalleau
Copy link
Collaborator Author

New PR with squashed commits is up in #445 (I believe that the commits you hadn't reviewed in order yet were either related to doc, the build system, or trivial, so it was indeed a good time for squashing)

I'll do a pass over current discussions in this PR to move them over to the new PR if they still seem relevant.

@omry
Copy link
Owner

omry commented Nov 26, 2020

Sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants