-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicates in selector output #23
Comments
As for me I haven't see any implementations that tries to remove duplicate values from result, and I don't think it's a good idea to introduce this behavior as default one. But maybe it's useful to provide some way to filter off duplicated values explicitly, like But your second case (about selecting the same node twice) is more interesting, because we have to provide some default behavior; I will investigate the implementations on this case. |
I think it might be useful to throw a scenario in the consensus project to get actual results of what the various implementations do. |
A scenario has been added. Just waiting on the report to regenerate. It looks like the majority just don't support the example query. For those that do, it seems to be an even mix of all three approaches. No concensus implies no pressure to implement any particular scenario, but we need to discuss and agree on it. |
An option is to allow some latitude in the spec and say that "duplicates MAY be removed, either by location and value". But I think in that case we'd also need to tie down when that applies. Should it just apply to unions? Could it also apply in other cases, e.g. when using "MAY" will considerably complicate the CTS FWIW, especially if the CTS is purely declarative (e.g. cts.json) as that would put the onus on each CTS implementation to allow duplicates to be removed if appropriate. Also, does allowing duplicates to be removed imply anything about ordering? Can non-duplicates change order, for example? |
I think it can apply to any case where the same value appears in the result more than once. I've opened a new issue for discussing result sequence. I think we need to answer that before we can discuss how it affects duplicates. |
To expand on the reason for combining duplicates a little, I offer a more extensive example and a somewhat informal proof of why eliminating duplicates would be a good thing (when not including location in the output): The instance is similar to Goessner's example, but I've added color to the books and moved the other items under a "toy" category so that I can search for array items and use an expression. {
"store": {
"book": [
{
"category": "reference",
"author": "Nigel Rees",
"title": "Sayings of the Century",
"price": 8.95 },
{
"category": "fiction",
"author": "Evelyn Waugh",
"title": "Sword of Honour",
"price": 12.99,
"color": "blue"
},
{
"category": "fiction",
"author": "Herman Melville",
"title": "Moby Dick",
"isbn": "0-553-21311-3",
"price": 8.99,
"color": "red"
},
{
"category": "fiction",
"author": "J. R. R. Tolkien",
"title": "The Lord of the Rings",
"isbn": "0-395-19395-8",
"price": 22.99,
"color": "green"
}
],
"bicycle": [
{
"color": "red",
"price": 19.95
},
],
"tricycle": [
{
"color": "blue",
"price": 9.95
},
{
"color": "red",
"price": 19.95
}
]
}
} Now if we want the prices of all the things that are red, we can use Let's ignore the The path // $.store.book[2]
{
"category": "fiction",
"author": "Herman Melville",
"title": "Moby Dick",
"isbn": "0-553-21311-3",
"price": 8.99,
"color": "red"
}
// $.bicycle[0]
{
"color": "red",
"price": 19.95
}
// $.tricycle[1]
{
"color": "red",
"price": 19.95
} Notice that the results for Duplicate removal resolves this problem. Even when we include locations, this can be used as an internal optimization. An implementation could identify that these are the same object and keep track of all the locations which contain it. |
I'm confused. Why do |
Ugh... Oversight of editing while authoring. I'll fix it. Aaand now I have to rework it to get the duplicates again. |
@mkmik updated. |
I think this is trading off performance, in some fairly rare cases, against simplicity of the spec. I'd prefer to keep the spec simpler. |
Allowing latitude defeats interoperability, which I had thought was the raison d'etre for the IETF. But I note there's no mention of interoperability as a design goal in the draft. From previous discussions, I think it is only necessary to consider removal of duplicates by location (not value.)
I don't think so. I think it should apply however duplicates arise, whether unions, wildcards, recursive descent. This also simplifies removal.
Or perhaps, either duplicates must be removed, or duplicates must not be removed, or implementations must default to one of these two, and provide an option for the other.
I don't think it affects order at all. Given the data {
"books":
[
{
"title" : "A Wild Sheep Chase",
"author" : "Haruki Murakami",
"price" : 22.72
},
{
"title" : "The Night Watch",
"author" : "Sergei Lukyanenko",
"price" : 23.58
},
{
"title" : "The Comedians",
"author" : "Graham Greene",
"price" : 21.99
},
{
"title" : "The Night Watch",
"author" : "Phillips, David Atlee"
}
]
} and selector
The results with duplicates are:
The results without duplicates are:
Clearly the second set can easily be obtained from the first without rearranging the items, see How do you remove duplicates from a list whilst preserving order? Note that implementations have some latitude as to when to remove duplicates. As long as both values and normalized paths are collected, the implementation could delay removal until the last step, before returning results to the user. |
Following up on gregsdennis's motivation for wanting to remove duplicates, there's a related use case that's applicable when a user wants to modify the original data based on addressing information obtained from a query. Such addressing information could be in the form of normalized paths, or it could be provided by references or pointers into the original data, in languages that support that. In either case, the issue is the same. So suppose that a user queries for a collection of prices for all books in the store, and wishes to discount them all by twenty percent. If duplicates come back, if the same price gets marked down twice, that's going to be a problem. This is not entirely hypothetical, there are jsonpath libraries that are used in this way. More generally, it's hard to think of use cases where duplicates would be beneficial. It's possible to think of use cases where they would be detrimental. For an implementation that collects normalized paths along with values, removal is easy. Of course, removal comes with a cost, and some users may prefer not to pay that cost, particularly as most straightforward queries don't result in duplicates anyway. |
I know the jury is still out on whether we remove duplicate nodes, but I've prepared draft PRs for the compliance test suite (jsonpath-standard/jsonpath-compliance-test-suite#2) and the reference implementation (jsonpath-standard/jsonpath-reference-implementation#27) to handle removal of duplicate nodes. Any comments gratefully received. |
I am preparing an issue (#88) about unions, where removing duplicate nodes is also a point |
I have a question about duplicates. Consider a root value
and suppose that a query returned two values
corresponding to paths
Would the node identified by |
No, we only consider duplicates at the top level of the returned nodelist. So the values to compare are an array and a string. That the array also happens to contain the string makes no difference. |
Consider: {
"x": {
"a": {
"a": {
"b": 1
}
}
}
} and this jsonpath: The first part of the jsonpath applying |
Yes, and we need to agree whether: (1) the duplicate nodes are always delivered (which is what we decided today) (2) the duplicate nodes are never delivered (which is what we went into the discussion with) (3) it depends on the implementation (which IMHO would be a disaster) |
IIRC today we were wondering whether: (5) the behavior of the descendent selector may be a special case |
Maybe we can record the resolution that there is no removal, neither of duplicate values nor of duplicate nodes. |
I'd be interested in how this resolution came about because it seems like the discussion here was leaning the other way: removing duplicates to some degree. To what degree was still under discussion. |
Slides for 112 said this:
We didn't have time to discuss this at 112, so let's confirm the resolution on the 2022-01-18 interim. |
Each selector may each return multiple nodes, or even the same node multiple times. This raises the question of whether (and if so, how) the resulting collection of nodes needs to be distinct, i.e. without duplicates.
An example:
The raw result set of this selector is
["string", null, false, "string"]
. If we say that we expect only the distinct set of nodes, then the results would be["string", null, false]
, removing the duplicate"string"
.So the question up for discussion is, "Should selectors return distinct sets of nodes or are duplicates preferred?"
I had been considering a location-based distinction (based on the location of the value in the original JSON structure) as a possibility, but I don't think it makes much sense given the definition of a selector in Section 2.5, taking multiple nodes and outputing multiple nodes. A selector knows nothing about the original JSON structure that is being evaluated.
However for posterity, consider the following (rather contrived) path evaluated against the above instance:
In this case, the
"string"
value which is the first element of thea
property would be selected twice, and the"string"
value in the second element of theb
property would be selected once, resulting in three"string"
values in the results. For location-based distinction, only the value froma
would be considered duplicate, leaving two instances of"string"
in the result set.Again, I don't think that this is a viable approach due to how a selector works and the information it's given. However, if we were to also require location as part of the output, then this would be a viable solution.
The text was updated successfully, but these errors were encountered: