Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecating auto-expand of nested dataclasses without default value in OmegaConf 2.0 #412

Closed
omry opened this issue Oct 18, 2020 · 15 comments

Comments

@omry
Copy link
Owner

omry commented Oct 18, 2020

TLDR

You have a config with nested dataclass without a default value.
The behavior of this will change in OmegaConf 2.1 and you should give it a default value (either MISSING or an object of the appropriate type).

Read on to see why this is changing and what are the migration options.

Primitive fields in OmegaConf 2.0

Consider the following:

from omegaconf import MISSING
from dataclasses import dataclass

@dataclass
class User1:
    age: int = MISSING
    name: str = MISSING

The MISSING values are required in OmegaConf 2.0. Creating a Structured Config from User1 with fields that does not have a default value will result in this error:

ValueError: Missing default value for User1.age, to indicate default must be populated later use 
OmegaConf.MISSING

Primitive fields in OmegaConf 2.1

OmegaConf 2.1 will allow that and will automatically treat age and name as MISSING, making this equivalent to the definition above:

@dataclass
class User2:
    age: int
    name: str

Nested dataclasses in OmegaConf 2.0

In contrast to primitive values like the ones above, OmegaConf 2.0 will not complain about nested data classes without a default value:

class Config:
    user: User1

Instead, It will auto expand using the fields from the class User.
This behavior is surprising and is in conflict the behavior primitive fields in 2.1.

Nested dataclasses in OmegaConf 2.1

OmegaConf 2.1 will change this behavior to be in line with that for primitive fields:
The above will be equivalent to:

class Config:
    user: User1 = MISSING

Migration options

If you are relying on the deprecated behavior you will need to assign an explicit value to create the same config.

Recommended option

Assign an actual object of the appropriate type:

class Config:
    user: User1 = User1()

Note that if you omit the default values for age and name like in User2, you will have to manually provide default values for Python to instantiate the object. You can still use MISSING:

class Config:
    user: User2 = User2(age=MISSING, name=MISSING)

More compact but not recommended option:

An more compact alternative is to assign the type itself to the field:

class Config:
    user: User = User # type: ignore

Python type checkers will not be happy about it. use # type: ignore to allow it anyway.
For this reason it's recommended to assign an actual object here.

@romesco
Copy link

romesco commented Oct 18, 2020

This makes sense. In my current use case: pytorch/hydra-torch#21,
we are going to use the Recommended option 👍

@omry omry closed this as completed Jun 4, 2021
@raphCode
Copy link

raphCode commented Dec 27, 2022

Sorry for digging this up, but I spent my day struggling with the updated behavior of nesting dataclasses.

First, the difference between not assigning anything and an actual object instance is not documented in the manual. It is only implicitely hinted at by using factories in the examples.

Second, the situation that lead to my confusion:

  • I use a fairly complex nested structured config as schema / validation for a user-supplied hydra config
  • Even without assigning actual objects, the type information from nested dataclasses is taken into account correctly. Put differently: everything works for my use case so far, so I did not consider adding ugly factories or object assignments to my dataclasses.
  • I wanted to add some default values in my schema that the user config can't overwrite (_convert_ keys for hydra)
  • The resulting config had ??? in all the keys that should contain default values

I find this result highly confusing and would very much prefer the old behavior because:

  • The typing information of nested classes is considered even without assignments, so it is surprising this does not work for values as well
  • Even more confusing, default fields show up in the merged config, but their value is crippled sometimes (see example)
  • The proposed solutions litter my structured config with assignments that create visual noise and no obvious benefit (or don't typecheck) - in short, they are not very pythonic

Here is an example:

from dataclasses import dataclass
from omegaconf import OmegaConf, errors
from contextlib import suppress

@dataclass
class SubConfig:
    x: int
    y: str = "hi"

@dataclass
class Config:
    s: SubConfig

c = OmegaConf.structured(Config)

print(c)
# {'s': '???'}

# merging c first works as expected...
print(OmegaConf.merge(c, dict(s=dict(x=3))))
# {'s': {'x': 3, 'y': 'hi'}}

# ... but I can't add new keys:
with suppress(errors.ConfigKeyError):
    OmegaConf.merge(c, dict(s=dict(x=3), extra=5))

# when trying to merge the other way round, the nested field 'y' is
# carried into the resulting config, but the value is lost:
print(OmegaConf.merge(dict(s=dict(x=3)), c))
# {'s': {'x': 3, 'y': '???'}}

What is the advantage of the updated behavior? The first post mentions that is should create less surprises, but I think actually the opposite is true.

@raphCode
Copy link

Pinging @Jasha10 since he seems more active recently in the repo.

If you find time, please tell me your thoughts or what I did wrong.

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 28, 2022

Hi @raphCode,
My impression is that your confusion is somewhat orthogonal to the behavior change described by the original post above.

Let me start here:

# ... but I can't add new keys:
with suppress(errors.ConfigKeyError):
    OmegaConf.merge(c, dict(s=dict(x=3), extra=5))

Being unable to add new keys is a feature of using structured configs; this feature is active whether there are missing values present or not.
For example:

c_non_missing = OmegaConf.structured(Config(SubConfig(123)))
print(c_non_missing)  # {'s': {'x': 123, 'y': 'hi'}}
with suppress(errors.ConfigKeyError):
    OmegaConf.merge(c_non_missing, dict(s=dict(x=3), extra=5))

The same ConfigKeyError is raised regardless of whether c or c_non_missing is used:

omegaconf.errors.ConfigKeyError: Key 'extra' not in 'Config'
    full_key: extra
    object_type=Config

@raphCode
Copy link

Hi, thank you for your explanation.

I agree that the example about adding new keys was a bit out of context.
However, I still believe my perceived issue is related to the behavior change. Maybe I find better words below:

As a hydra user, I value structured configs because I can check the types and presence of values in a user specified config. For this validation to happen, hydra merges the structured config with the user config at some point.
In OmegaConf terms, this looks like one of these options:

OmegaConf.merge(structured_config, user_config)
OmegaConf.merge(user_config, structured_config)

I can control which one happens with the hydra defaults list.
The validation happens in both cases since that is always triggered when the merge involves a DictConfig which was created from dataclasses.
In my app, apart from validating user configs, I have two more requirements:

  1. the user config must be able to add new keys
  2. default values from the structured config should be carried into the merged config

I played around and noticed that this correctly merges nested default values, in other words fulfills requirement 2:

OmegaConf.merge(structured_config, user_config)

Please note this works even without default assignments for the nested dataclasses, see my first example here.

For requirement 1 I need to swap the merge partners.
I expected when I swap them, I would get the same resulting config. As I tried to demonstrate in my last example here, it does not work.
My confusion:

  • The merges have a different result although no keys are in conflict
  • The keys of the structured config items are always present in the config, but the default values are not

I now suspect that with the old pre-2.0 behavior, I would get the same results.
Is my understanding correct so far?

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 28, 2022

To make sure I understand correctly: the point is that the following two print statements give different results, right?

c = OmegaConf.structured(Config)
print(OmegaConf.merge(dict(s=dict(x=3)), c))  # {'s': {'x': 3, 'y': '???'}}
print(OmegaConf.merge(c, dict(s=dict(x=3))))  # {'s': {'x': 3, 'y': 'hi'}}

I now suspect that with the old pre-2.0 behavior, I would get the same results.

You might be right about this. We could also potentially regard this as an OmegaConf.merge bug. I'm not sure what the pros and cons of changing the merge behavior would be (and I'm not sure how breaking of a change it would be for downstream users).

default values from the structured config should be carried into the merged config

In my own work I tend to guard against missing values being passed to my app by using one of the following OmegaConf routines:

  • assert not OmegaConf.missing_keys(cfg)
  • OmegaConf.to_container(cfg, throw_on_missing=True)
  • OmegaConf.to_object(cfg)

@raphCode
Copy link

To make sure I understand correctly: the point is that the following two print statements give different results, right?

Yes, I don't understand why they differ.

My actual point is that I am complaining about the behavior change and don't see a benefit in the new way of handling nested dataclasses:

Merge order discrepancy

While I was experimenting, I discovered that omegaconf still seems to auto-expand structured configs sometimes, as you correctly summarized.
I don't know if that is considered a bug, but it hints that the new behavior is at least not implemented thoroughly.
At the very least I am annoyed that the functionality I want is still present, but I can't use it due to my requirement 1 as outlined in my previous post.

Bad migration options

The recommended options suggest a dataclass instance to be created and assign omegaconf.MISSING to the empty fields, creating a coupling between the dataclass and omegaconf. This dependency was the reason to remove the requirement for MISSING assignments.
The not recommended option suggests to assign the plain class, which requires ugly typechecking silencing.

Useless default assignments

It forces me to add default assignments to nested dataclasses, which I do not like because:

  • it adds visual noise to the class definitions - not very pythonic
  • it makes the dataclasses instantiable with nonsense default values

The dependency and instancing concern annoy me especially because I want use the dataclasses as a container for the config at runtime and for typechecking my code which accesses it.

Structured configs are all about nesting

This is implicitly acknowledged because even without default assignments, the types of nested dataclasses are validated - or should I say auto-expanded?
I don't see a single use case where one does not want the auto-expansion for values as well.

Default values in ListConfig

Auto-expand lets me carry default values into a ListConfig with varying length:

from attrs import define
from omegaconf import OmegaConf

@define
class SubConfig:
    x: int
    y: str = "hi"

@define
class Config:
    s: list[SubConfig]

c = OmegaConf.structured(Config)

print(OmegaConf.merge(c, dict(s=[dict(x=3)]*2)))
# {'s': [{'x': 3, 'y': 'hi'}, {'x': 3, 'y': 'hi'}]}

I found not way to perform this operation with the new behavior, since the list length must be given in the default assignment of the dataclass:

s: list[SubConfig] = [SubConfig]  # actual list instance with fixed length, does not adapt to user configs of differing lengths
s: list[SubConfig] = list[SubConfig]  # Invalid value assigned: GenericAlias is not a ListConfig, list or tuple.

@omry
Copy link
Owner Author

omry commented Dec 30, 2022

# when trying to merge the other way round, the nested field 'y' is
# carried into the resulting config, but the value is lost:
print(OmegaConf.merge(dict(s=dict(x=3)), c))
# {'s': {'x': 3, 'y': '???'}}

This does feel unexpected.
Probably worth trying to fix, not 100% sure it's related to this issue or what are the implications of changing this behavior.
Maybe file another issue and try to work on a fix. See what tests this breaks and why.

@omry
Copy link
Owner Author

omry commented Dec 30, 2022

cc @odelalleau

@odelalleau
Copy link
Collaborator

# when trying to merge the other way round, the nested field 'y' is
# carried into the resulting config, but the value is lost:
print(OmegaConf.merge(dict(s=dict(x=3)), c))
# {'s': {'x': 3, 'y': '???'}}

This does feel unexpected. Probably worth trying to fix, not 100% sure it's related to this issue or what are the implications of changing this behavior. Maybe file another issue and try to work on a fix. See what tests this breaks and why.

Yes I agree, I feel like this should probably be considered as a bug. The merge() code is somewhat complex and has a bunch of ad-hoc logic to handle the various scenarios it can encounter, and it's possible that this one is just not handled correctly right now.

@Jasha10
Copy link
Collaborator

Jasha10 commented Dec 30, 2022

One motivation for keeping the current behavior is the special treatment of MISSING with respect to overwriting other arguents: if the second argument to merge has missing values, those values will not overwrite the first argument to merge:

OmegaConf.merge({'y': 'bye'}, {'x': 3, 'y': '???'})  # {'y': 'bye', 'x': 3}
OmegaConf.merge({'y': 'bye'}, {'x': 3, 'y': 'hi'})  # {'y': 'hi', 'x': 3}

This means aggressively auto-expanding default arguments would have implications for multi-step merges.

# current behavior:
>>> merge({'s': {'y': 'bye'}}, merge(dict(s=dict(x=3)), c))
{'s': {'y': 'bye', 'x': 3}}
# behavior after proposed change:
>>> merge({'s': {'y': 'bye'}}, merge(dict(s=dict(x=3)), c))
{'s': {'y': 'hi', 'x': 3}}

Currently, 'bye' from the left-hand argument is not overwritten. I worry that changing this could have ripple effects for Hydra.

@odelalleau
Copy link
Collaborator

This means aggressively auto-expanding default arguments would have implications for multi-step merges.

That's a good point, but personally I would favor intuitive and useful behavior on single-step merges before worrying about multi-step merges (which, anyway, are just a sequence of single-step merges, so if we agree on what a single-step merge should do, hopefully that makes it clear what the multi-step should do as well).

Also, I haven't double checked, but normally if s is actually a structured SubConfig (and not just a dict), then nothing should happen when we merge a Config (with missing SubConfig field) into it, regardless of whether s's fields are missing or not. I think that should not change. Which means this subtle behavior shouldn't come up when working only with structured configs (but I haven't read the whole discussion in details, so if there's actually a problem with only structured configs, sorry I missed it!)

@raphCode
Copy link

raphCode commented Jan 1, 2023

Okay, I get the current behavior will probably stay.

Nonetheless I would really like to see auto-expand functionality during merges, not only because it feels nicer, also because there are some use cases where there is just no alternative:
When merging a structured configs contained in a list, the information how many list items should be expanded is only available at merge time. See here for a code example (at the bottom).

How about using dataclass field metadata to opt-in to auto-expand during merges?
Like the omegaconf_ignore functionality.
That would enable auto expansion on a per-field basis, and in the case of attrs, it can be implemented cleanly with a field transformer.

@raphCode
Copy link

raphCode commented Jan 5, 2023

This does the trick for me now:

def merge_structured_config_defaults(cfg: Any) -> None:
    """
    This function takes an OmegaConf Config and recursively merges the non-optional
    default values of the underlying structured config classes in-place.
    This is necessary because the user config may override important keys in the schema,
    like _partial_ that control instantiation. Merging the schema defaults ensures the
    correct values of these keys.
    Keys set to None in the schema are Optional and may be overridden by the user, so
    these are not replaced.

    This manual implementation is necessary because Omegaconf disabled auto-expanding of
    nested structured configs, otherwise merging the schema on top of the user config
    would do the trick:
    https://github.com/omry/omegaconf/issues/412
    The proposed solutions are unnecessary verbose (default assignments) and worse, they
    don't allow for default values to propagate into variable-length lists.
    """
    if isinstance(cfg, DictConfig):
        for key in cfg:
            if not OmegaConf.is_missing(cfg, key):
                merge_structured_config_defaults(cfg[key])

        t = OmegaConf.get_type(cfg)
        if omegaconf._utils.is_structured_config(t):
            defaults = OmegaConf.structured(t)
            for key in cfg:
                if key in defaults and defaults[key] is not None:
                    OmegaConf.update(cfg, key, defaults[key])  # type: ignore [arg-type]

    elif isinstance(cfg, ListConfig):
        for item in cfg:
            merge_structured_config_defaults(item)

I don't mind if this function would be added to Omegaconf itself.

@omry
Copy link
Owner Author

omry commented Jan 6, 2023

Currently, 'bye' from the left-hand argument is not overwritten. I worry that changing this could have ripple effects for Hydra.

Good catch.
I think the desired behavior is that only missing values are brought in from missing dataclasses on autoexpand.
Such values would carry the type and would not override an existing value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants