Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in selector output #23

Closed
gregsdennis opened this issue Sep 25, 2020 · 22 comments
Closed

Duplicates in selector output #23

gregsdennis opened this issue Sep 25, 2020 · 22 comments

Comments

@gregsdennis
Copy link
Collaborator

gregsdennis commented Sep 25, 2020

Each selector may each return multiple nodes, or even the same node multiple times. This raises the question of whether (and if so, how) the resulting collection of nodes needs to be distinct, i.e. without duplicates.

An example:

JSON instance:
{
  "a": [
    "string",
    null,
    true
  ],
  "b": [
    false,
    "string",
    5.4
  ]
}

Path:
$.*[0,1]

The raw result set of this selector is ["string", null, false, "string"]. If we say that we expect only the distinct set of nodes, then the results would be ["string", null, false], removing the duplicate "string".

So the question up for discussion is, "Should selectors return distinct sets of nodes or are duplicates preferred?"


I had been considering a location-based distinction (based on the location of the value in the original JSON structure) as a possibility, but I don't think it makes much sense given the definition of a selector in Section 2.5, taking multiple nodes and outputing multiple nodes. A selector knows nothing about the original JSON structure that is being evaluated.

However for posterity, consider the following (rather contrived) path evaluated against the above instance:

$.*[0,:5]

In this case, the "string" value which is the first element of the a property would be selected twice, and the "string" value in the second element of the b property would be selected once, resulting in three "string" values in the results. For location-based distinction, only the value from a would be considered duplicate, leaving two instances of "string" in the result set.

Again, I don't think that this is a viable approach due to how a selector works and the information it's given. However, if we were to also require location as part of the output, then this would be a viable solution.

@remorhaz
Copy link
Contributor

remorhaz commented Sep 25, 2020

As for me I haven't see any implementations that tries to remove duplicate values from result, and I don't think it's a good idea to introduce this behavior as default one. But maybe it's useful to provide some way to filter off duplicated values explicitly, like $.*[0,1].unique().

But your second case (about selecting the same node twice) is more interesting, because we have to provide some default behavior; I will investigate the implementations on this case.

@gregsdennis
Copy link
Collaborator Author

I think it might be useful to throw a scenario in the consensus project to get actual results of what the various implementations do.

@gregsdennis
Copy link
Collaborator Author

gregsdennis commented Sep 28, 2020

A scenario has been added. Just waiting on the report to regenerate.

It looks like the majority just don't support the example query. For those that do, it seems to be an even mix of all three approaches. No concensus implies no pressure to implement any particular scenario, but we need to discuss and agree on it.

@glyn
Copy link
Collaborator

glyn commented Oct 1, 2020

An option is to allow some latitude in the spec and say that "duplicates MAY be removed, either by location and value". But I think in that case we'd also need to tie down when that applies. Should it just apply to unions? Could it also apply in other cases, e.g. when using [*] against an array containing duplicate values (e.g. because there are duplicate values in the input document)?

"MAY" will considerably complicate the CTS FWIW, especially if the CTS is purely declarative (e.g. cts.json) as that would put the onus on each CTS implementation to allow duplicates to be removed if appropriate.

Also, does allowing duplicates to be removed imply anything about ordering? Can non-duplicates change order, for example?

@gregsdennis
Copy link
Collaborator Author

I think it can apply to any case where the same value appears in the result more than once.

I've opened a new issue for discussing result sequence. I think we need to answer that before we can discuss how it affects duplicates.

@gregsdennis
Copy link
Collaborator Author

gregsdennis commented Jan 5, 2021

To expand on the reason for combining duplicates a little, I offer a more extensive example and a somewhat informal proof of why eliminating duplicates would be a good thing (when not including location in the output):

The instance is similar to Goessner's example, but I've added color to the books and moved the other items under a "toy" category so that I can search for array items and use an expression.

{
  "store": {
    "book": [ 
      {
        "category": "reference",
        "author": "Nigel Rees",
        "title": "Sayings of the Century",
        "price": 8.95      },
      {
        "category": "fiction",
        "author": "Evelyn Waugh",
        "title": "Sword of Honour",
        "price": 12.99,
        "color": "blue"
      },
      {
        "category": "fiction",
        "author": "Herman Melville",
        "title": "Moby Dick",
        "isbn": "0-553-21311-3",
        "price": 8.99,
        "color": "red"
      },
      {
        "category": "fiction",
        "author": "J. R. R. Tolkien",
        "title": "The Lord of the Rings",
        "isbn": "0-395-19395-8",
        "price": 22.99,
        "color": "green"
      }
    ],
    "bicycle": [
      {
        "color": "red",
        "price": 19.95
      },
    ],
    "tricycle": [
      {
        "color": "blue",
        "price": 9.95
      },
      {
        "color": "red",
        "price": 19.95
      }
    ]
  }
}

Now if we want the prices of all the things that are red, we can use $..[?(@.color == "red")].price. This results in [8.99, 19.95, 19.95] without removal and [8.99, 19.95] with removal.

Let's ignore the .price for now to see what how the duplicate values are produced.

The path $..[?(@.color == "red")] returns three objects:

// $.store.book[2]
{
  "category": "fiction",
  "author": "Herman Melville",
  "title": "Moby Dick",
  "isbn": "0-553-21311-3",
  "price": 8.99,
  "color": "red"
}

// $.bicycle[0]
{
  "color": "red",
  "price": 19.95
}

// $.tricycle[1]
{
  "color": "red",
  "price": 19.95
}

Notice that the results for $.bicycle[0] and $.tricycle[1] are the same. Any processing on further path segments will result in the same output. Therefore we should only need to perform further selection once on these values.

Duplicate removal resolves this problem.


Even when we include locations, this can be used as an internal optimization. An implementation could identify that these are the same object and keep track of all the locations which contain it.

@mkmik
Copy link
Collaborator

mkmik commented Jan 5, 2021

I'm confused. Why do $.toy[0] and $.toy[2] in your last example lack the name property?

@gregsdennis
Copy link
Collaborator Author

gregsdennis commented Jan 5, 2021

Ugh... Oversight of editing while authoring. I'll fix it.

Aaand now I have to rework it to get the duplicates again.

@gregsdennis
Copy link
Collaborator Author

@mkmik updated.

@glyn
Copy link
Collaborator

glyn commented Jan 5, 2021

Even when we include locations, this can be used as an internal optimization.

I think this is trading off performance, in some fairly rare cases, against simplicity of the spec. I'd prefer to keep the spec simpler.

@danielaparker
Copy link

danielaparker commented Feb 1, 2021

An option is to allow some latitude in the spec and say that "duplicates MAY be removed, either by location and value".

Allowing latitude defeats interoperability, which I had thought was the raison d'etre for the IETF. But I note there's no mention of interoperability as a design goal in the draft.

From previous discussions, I think it is only necessary to consider removal of duplicates by location (not value.)

Should it just apply to unions?

I don't think so. I think it should apply however duplicates arise, whether unions, wildcards, recursive descent. This also simplifies removal.

"MAY" will considerably complicate the CTS FWIW, especially if the CTS is purely declarative (e.g. cts.json) as that would put the > onus on each CTS implementation to allow duplicates to be removed if appropriate.

Or perhaps, either duplicates must be removed, or duplicates must not be removed, or implementations must default to one of these two, and provide an option for the other.

Also, does allowing duplicates to be removed imply anything about ordering? Can non-duplicates change order, for example?

I don't think it affects order at all. Given the data

{
    "books":
    [
        {
            "title" : "A Wild Sheep Chase",
            "author" : "Haruki Murakami",
            "price" : 22.72
        },
        {
            "title" : "The Night Watch",
            "author" : "Sergei Lukyanenko",
            "price" : 23.58
        },
        {
            "title" : "The Comedians",
            "author" : "Graham Greene",
            "price" : 21.99
        },
        {
            "title" : "The Night Watch",
            "author" : "Phillips, David Atlee"
        }
    ]
}

and selector

$.books[1,1,3].title

The results with duplicates are:

Values Paths
"The Night Watch" $['books'][1]['title']
"The Night Watch" $['books'][1]['title']
"The Night Watch" $['books'][3]['title']

The results without duplicates are:

Values Paths
"The Night Watch" $['books'][1]['title']
"The Night Watch" $['books'][3]['title']

Clearly the second set can easily be obtained from the first without rearranging the items, see How do you remove duplicates from a list whilst preserving order? Note that implementations have some latitude as to when to remove duplicates. As long as both values and normalized paths are collected, the implementation could delay removal until the last step, before returning results to the user.

@danielaparker
Copy link

danielaparker commented Feb 1, 2021

To expand on the reason for combining duplicates a little, I offer a more extensive example and a somewhat informal proof of
why eliminating duplicates would be a good thing (when not including location in the output):

Following up on gregsdennis's motivation for wanting to remove duplicates, there's a related use case that's applicable when a user wants to modify the original data based on addressing information obtained from a query. Such addressing information could be in the form of normalized paths, or it could be provided by references or pointers into the original data, in languages that support that. In either case, the issue is the same.

So suppose that a user queries for a collection of prices for all books in the store, and wishes to discount them all by twenty percent. If duplicates come back, if the same price gets marked down twice, that's going to be a problem. This is not entirely hypothetical, there are jsonpath libraries that are used in this way.

More generally, it's hard to think of use cases where duplicates would be beneficial. It's possible to think of use cases where they would be detrimental. For an implementation that collects normalized paths along with values, removal is easy. Of course, removal comes with a cost, and some users may prefer not to pay that cost, particularly as most straightforward queries don't result in duplicates anyway.

@ghost ghost added the processing-model label Mar 10, 2021
@glyn
Copy link
Collaborator

glyn commented Mar 26, 2021

I know the jury is still out on whether we remove duplicate nodes, but I've prepared draft PRs for the compliance test suite (jsonpath-standard/jsonpath-compliance-test-suite#2) and the reference implementation (jsonpath-standard/jsonpath-reference-implementation#27) to handle removal of duplicate nodes. Any comments gratefully received.

@goessner
Copy link
Collaborator

goessner commented Mar 26, 2021

I am preparing an issue (#88) about unions, where removing duplicate nodes is also a point

@danielaparker
Copy link

I have a question about duplicates.

Consider a root value

{
    "books": [
        {"title": "A Wild Sheep Chase"},
        {"title": "The Comedians"}
    ]
}

and suppose that a query returned two values

[
    [
        {"title": "A Wild Sheep Chase"},
        {"title": "The Comedians"}
    ],
    {"title": "A Wild Sheep Chase"}
]

corresponding to paths

[
    "$['books']",
    "$['books'][0]"
]

Would the node identified by "$['books'][0]" be considered a duplicate?

@gregsdennis
Copy link
Collaborator Author

Would the node identified by "$['books'][0]" be considered a duplicate?

No, we only consider duplicates at the top level of the returned nodelist. So the values to compare are an array and a string. That the array also happens to contain the string makes no difference.

@mkmik
Copy link
Collaborator

mkmik commented May 11, 2021

Consider:

{
  "x": {
    "a": {
      "a": {
        "b": 1
      }
    }
  }
}

and this jsonpath: $..a..b:

The first part of the jsonpath $..a selects nodes {"a": {"a": { "b": 1}}} and {"a": { "b": 1}}

applying ..b on that result set, produces the node 1 twice

@cabo
Copy link
Member

cabo commented May 11, 2021

Yes, and we need to agree whether:

(1) the duplicate nodes are always delivered (which is what we decided today)

(2) the duplicate nodes are never delivered (which is what we went into the discussion with)

(3) it depends on the implementation (which IMHO would be a disaster)

@mkmik
Copy link
Collaborator

mkmik commented May 11, 2021

IIRC today we were wondering whether:

(5) the behavior of the descendent selector may be a special case

@cabo
Copy link
Member

cabo commented Nov 10, 2021

Maybe we can record the resolution that there is no removal, neither of duplicate values nor of duplicate nodes.

@gregsdennis
Copy link
Collaborator Author

Maybe we can record the resolution that there is no removal

I'd be interested in how this resolution came about because it seems like the discussion here was leaning the other way: removing duplicates to some degree. To what degree was still under discussion.

@cabo
Copy link
Member

cabo commented Jan 17, 2022

Slides for 112 said this:

Discussion confused by the lack of distinction between duplicate values and duplicate nodes.
Pretty clear that we don't remove duplicate values.
PR #134 makes it less attractive to remove duplicate nodes.
Proposal: Don't.

We didn't have time to discuss this at 112, so let's confirm the resolution on the 2022-01-18 interim.

@cabo cabo added the editorial label Jun 14, 2022
@glyn glyn self-assigned this Jul 14, 2022
@glyn glyn closed this as completed in a6fb8f7 Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants