Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate keys are not handled properly #165

Open
horenmar opened this issue May 10, 2018 · 20 comments
Open

Duplicate keys are not handled properly #165

horenmar opened this issue May 10, 2018 · 20 comments
Labels

Comments

@horenmar
Copy link

YAML spec in version 1.2 says that in a valid YAML file, mapping keys are unique.

This is not the case in pyyaml, as can be seen by loading this sample file:

a:
  - b
  - c

a:
  q: a
  q: b

The correct result from loading this file is an error. pyyaml instead naively overwrites results with the last key, resulting in this dict: {'a': {'q': 'b'}}.

@anz-rfc
Copy link

anz-rfc commented Oct 16, 2018

some discussion of workarounds here: https://gist.github.com/pypt/94d747fe5180851196eb

@mady143
Copy link

mady143 commented Feb 25, 2019

I am also getting same issue but its slight different click here: gunthercox/ChatterBot#1640

could any one find the solution

openstack-gerrit pushed a commit to openstack/openstack that referenced this issue Feb 26, 2019
* Update placement from branch 'master'
  - Merge "Fix a bad granular gabbi test"
  - Fix a bad granular gabbi test
    
    A gabbi test for multiple member_of<N> was using the `query_parameters`
    keyword to construct the querystring; but it was trying to test what
    happens when member_of<N> is specified multiple times. In this test, N
    was the same for both, but the query parameters were being entered as
    separate dict keys; so the querystring was only being constructed with
    the latter value, because yaml [1]. The test was passing spuriously
    because it happens to be the case that the result would be the same if
    only the latter value is specified. (If you reversed the order of the
    two member_ofZ, the test failed.)
    
    This change re-YAMLs the query_parameters to use list syntax for the
    values of the member_of key, from which gabbi will dtrt in constructing
    the querystring.
    
    NB: It might be nice to find a test scenario where the false positive
    wouldn't have been possible; but that would be a bigger effort that
    could possibly entail reswizzling the GranularFixture and therefore the
    whole gabbit. Done this way, we're at least sure that both values are
    making it to the handler; and switching the order in the querystring has
    no effect (though the order is apparently not guaranteed/deterministic
    anyway [2]).
    
    [1] yaml/pyyaml#165
    [2] https://gabbi.readthedocs.io/en/latest/example.html (search for
    query_parameters)
    
    Change-Id: I10f28d8c21643be69b67f25dcc043cd9640eac42
openstack-gerrit pushed a commit to openstack/placement that referenced this issue Feb 26, 2019
A gabbi test for multiple member_of<N> was using the `query_parameters`
keyword to construct the querystring; but it was trying to test what
happens when member_of<N> is specified multiple times. In this test, N
was the same for both, but the query parameters were being entered as
separate dict keys; so the querystring was only being constructed with
the latter value, because yaml [1]. The test was passing spuriously
because it happens to be the case that the result would be the same if
only the latter value is specified. (If you reversed the order of the
two member_ofZ, the test failed.)

This change re-YAMLs the query_parameters to use list syntax for the
values of the member_of key, from which gabbi will dtrt in constructing
the querystring.

NB: It might be nice to find a test scenario where the false positive
wouldn't have been possible; but that would be a bigger effort that
could possibly entail reswizzling the GranularFixture and therefore the
whole gabbit. Done this way, we're at least sure that both values are
making it to the handler; and switching the order in the querystring has
no effect (though the order is apparently not guaranteed/deterministic
anyway [2]).

[1] yaml/pyyaml#165
[2] https://gabbi.readthedocs.io/en/latest/example.html (search for
query_parameters)

Change-Id: I10f28d8c21643be69b67f25dcc043cd9640eac42
@parrenin
Copy link

I just wanted to comment that in my case, the overwrite is a feature and not a bug.
In my model, I have several "sites", each of them with a yaml parameter file.
And I have also a parameter file which apply to all sites.
So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry.
So it would be helpful to at least keep an option for the overwrite feature.

@marvingreenberg
Copy link

This seems to duplicate #41.

@marvingreenberg
Copy link

marvingreenberg commented Aug 26, 2019

So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry.

@parrenin I think this is misguided. Since this violates the spec, you don't know whether the original or the overridden value would be active (without looking at the implementation, which could change at any time, since it is not part of the spec.) Simply reading the "base" and "overriding" settings as two separate files, and combining as you see fit would be more correct.

@marvingreenberg
Copy link

The documentation, specifically "Deviations from the specification" in https://pyyaml.org/wiki/PyYAMLDocumentation should be updated.

@marvingreenberg
Copy link

I looked in the code, and fixing this is pretty simple (famous last words). I was going to submit a pull request, but there are actually tests validating that duplicate keys ARE ALLOWED, so obviously the maintainers have some different opinions about this. The patch at https://gist.github.com/marvingreenberg/ba2f67f740400f722680fc7c538e94a0 does prevent duplicate keys, and seems correct. But, there are then 4 test failures. Three of them seem simple to simply disable (or change to verify that an exception is raised). The fourth I don't quite understand, and may have something to do with places where YAML becomes "not actually so simple" like references and other subtle things. If pyyaml simply intends to keep this behavior (since 2006) I'd say closing this case and updating the documentation is warranted.

@marvingreenberg
Copy link

And not surprisingly, the patch I include is incomplete since there is a separate constructor for a yaml "ordered mapping" which specifically has a comment that "# Note: we do not check for duplicate keys, because it's too CPU-expensive." (which again seems misguided to me).

@perlpunk
Copy link
Member

perlpunk commented Aug 26, 2019

One problem is that this logic would break the functionality of merge keys:
https://yaml.org/type/merge.html
See def flatten_mapping.

To fix this, we would need a seperate handling of the merge keys and the actual keys in the mapping.

I don't know which tests were failing (haven't tested your patch yet), but I guess the ones with merge keys were among them.

@marvingreenberg
Copy link

If you want I can take the trouble of making a pull request, and try my best at updating the tests to be complete. I am a little unclear about what the goals, according to the folks in a better position to decide (the maintainers), are with respect to this particular issue. I think it's important to follow the spec -- silently and without documentation ignoring the spec seems wrong to me. But, YAML is definitely more complicated than "oh, its just a simple little thing".

@perlpunk
Copy link
Member

@marvingreenberg yeah, I think forbidding duplicate keys should be the default.
Adding an option for disabling it might be nice.

But merge key behaviour should stay the same.

@ingydotnet is the one to decide about this.

@perlpunk
Copy link
Member

@parrenin

So I concatenate the general parameter file with each individual parameter file which thus overwrite the general parameter file when there is a duplicate entry

I think this sounds like a use case for merge keys, actually.

encukou added a commit to encukou/naucse_render that referenced this issue Nov 25, 2019
Part of the fix for: pyvec/naucse#2
Workaround for: yaml/pyyaml#165

The workaround is incomplete, but should be OK for our uses.
@perlpunk
Copy link
Member

perlpunk commented Jan 7, 2020

The YAML 1.1 spec also says that keys should be unique: https://yaml.org/spec/1.1/#id861060

The content of a mapping node is an unordered set of key: value node pairs, with the restriction that each of the keys is unique

encukou added a commit to encukou/naucse_render that referenced this issue Jan 10, 2020
Part of the fix for: pyvec/naucse#2
Workaround for: yaml/pyyaml#165

The workaround is incomplete, but should be OK for our uses.
@wimglenn
Copy link

@perlpunk Does that include the merge keys themselves, or only the keys after a resolved merge?

e.g.

- k0: v0
  vars: &splat0
    VAR0: potato  # vars from blob0

- k1: v1
  vars: &splat1
    VAR1: spud    # vars from blob1

- k: v
  extra:          # merged vars from blob0 + blob1
    <<: *splat0
    <<: *splat1

@perlpunk
Copy link
Member

perlpunk commented Feb 13, 2020

@wimglenn the correct way to use multiple merge keys is actually:

- k: v
  extra:          # merged vars from blob0 + blob1
    <<: [ *splat0, *splat1 ]

That way duplicate merge keys can be avoided.

Other than that, the spec is actually not very specific about equality of nodes.
For example, 0xa and 10 are different before resolving, but equal after.

@wimglenn
Copy link

Then what about

- k: v
  extra:
    VAR0: before
    <<: [ *splat0, *splat1 ]
    VAR1: after

I know merge spec says that splat0 wins in case of duplicate keys in splat0 & splat1, but what does it say about dupes outside of the merge caused by resolving the merge?

@perlpunk
Copy link
Member

@wimglenn Duplicates should only be considered in the actual mapping definition itself. Merge keys are explicitly there for merging, so they should override keys.

By the way, the spec https://yaml.org/type/merge.html is not very detailed about this, but from how I read the spec and the example, it does not matter where the merge key appers - as the first key, in the middle or at the end. The order is always the same:

---
map:
  <<: [ *alias2, *alias3, *alias4 ]
  key: value1
  # key: something # would be an error

value1 overrides *alias2, *alias2 overrides *alias3, etc.

bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 7, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 7, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 13, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 13, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 13, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
bdhirsh added a commit to pytorch/pytorch that referenced this issue Oct 13, 2021
The codegen should error if it sees two yaml entries with the same key. The default behavior of python's yaml loader is to overwrite duplicate keys with the new value (see yaml/pyyaml#165).

This would have caught a nasty bug that showed up in https://github.com/pytorch/pytorch/pull/66225/files#r723796194.

I tested it on that linked PR, to confirm that it errors correctly (and gives the line number containing the duplicate).


Differential Revision: [D31464585](https://our.internmc.facebook.com/intern/diff/D31464585)

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants