Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalize how search 'fields' option returns nested data. #63709

Closed
jtibshirani opened this issue Oct 14, 2020 · 10 comments
Closed

Finalize how search 'fields' option returns nested data. #63709

jtibshirani opened this issue Oct 14, 2020 · 10 comments
Assignees
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@jtibshirani
Copy link
Contributor

jtibshirani commented Oct 14, 2020

Currently the 'fields' option has no special handling for nested fields -- it just returns them in a flat list (as it does for a non-nested object array):

"fields": {
  "products.base_price": [43.99, 20.99],
  "products.manufacturer": ["Microlutions", "Elitelligence"],
  "products.product_id": [17426, 19288]
}

This drops the relationship between nested fields and could be misleading about the search behavior. Instead we could return an object array, where each entry contains the fields for a nested document:

"fields": {
  "products": [{
    "base_price": [43.99],
    "manufacturer": ["Microlutions"],
    "product_id": [17426],
  },
  {
    "base_price": [20.99],
    "manufacturer": ["Elitelligence"],
    "product_id": [19288],
  }]
}

Yet another option would be to not return any nested data when requesting 'fields' for the root doc, to match the behavior for docvalue_fields. Nested fields would only be available when using the inner_hits option. This is simple to implement but feels lacking: our users are familiar with _source filtering, which allows for accessing nested objects at the root level.

@jtibshirani jtibshirani added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Oct 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@astefan
Copy link
Contributor

astefan commented Oct 20, 2020

That's a difficult choice to make. I think I understand the reason behind returning nested documents in a more hierarchical way (one nested field might be missing a value, for example, and the array of values for that nested field doesn't shed any light into which nested document is the one that has the missing value) but at the same time one would argue that the format the nested documents are returned into is inconsistent with the one for the rest of the fields.

I'd say we could treat nested documents as special and keep the hierarchy, since they are indeed special. At the moment, with the current implementation, fields and _source extraction has a difference in it, so we could say that _source extraction and fields are not exactly the same anyway:
for a dep nested field and query

    "_source": "dep.dep*",
    "fields": [{"field": "dep.dep*"}]

the result is

                "_source": {
                    "dep": [
                        {
                            "dep_id": "d005",
                            "dep_name": "Development"
                        }
                    ]
                },
                "fields": {
                    "dep.dep_id": [
                        "d005"
                    ],
                    "dep.dep_name.keyword": [
                        "Development"
                    ],
                    "dep.dep_name": [
                        "Development"
                    ]
                }

In SQL I think we only return nested documents as part of inner_hits. Even if we ask for a nested field without any condition applied to it, we still build a nested query with a match_all in it and populate inner_hits accordingly.

@cbuescher cbuescher self-assigned this Dec 10, 2020
@cbuescher
Copy link
Member

With the asks of the Kibana team in Discover Switch from _source to fields when fetching data · Issue #80517 · elastic/kibana · GitHub in mind, I’m currently considering the following options:

  • fetch all nested object paths that we recognise from the mapping service, get them from _source and add them to the fields API output. This shouldn’t collide with any other current output since we don’t currently add non-leave fields.
    • Option 1: just add the nested objects to the response, but leave current output of flattened subfields as is
      • advantages:
        • keeps information like e.g. keyword multi-fields of fields in the nested docs that are not part of _source
        • stays closer to the notion of the fields API to return a field centric view of the document
      • disadvantages: duplication of some fields content, might need pre-processing and filtering out on the client side
    • Option 2: remove all fields which have the nested path as prefix, e.g. if nested object is user, remove all additional user.* fields
      • advantages: potentially no additional filtering
      • disadvantages: might lose information like e.g. multi-fields under nested that are not in _source

Two unrelated decisions:

  • Should we only return nested object closest to the document root, like asked for in [Discover] Switch from _source to fields when fetching data kibana#80517 (comment), or would it make sense to include all nested objects (e.g. alongside nester user, also consider returning separate user.adress when adress is another nested object under user.
    • advantages: I feel this would be more coherent since we’d return all nested objects that are accessible
    • disadvantages: larger response size
  • Should all of this behaviour be hidden behind a flag?
    • this would have the advantage of keeping the “default” fields behaviour simpler and easier to explain, only clients that really need nested objects would need switch on this flag
    • I’d currently opt for a “global” flag that is directly defined under the main fields parameter in the request, e.g.
     “fields” : {
     	“include_nested” : true,  // would default to false
      	“fields” : [ “*” ] // maybe name this differently from the surrounding “fields”, e.g. “pattern” or “paths” ?
     }
    

@jtibshirani
Copy link
Contributor Author

I’m currently considering the following options:

Are we also considering the option suggested in the issue description? I don't think this fits neatly into either option you listed. That option would still load all mapped content (including multi-fields), but make sure that the response captures the nested document structure.

@cbuescher
Copy link
Member

Nested fields would only be available when using the inner_hits option.

@jtibshirani are you refering to this? I this came up in a discussion with Kibana we had lately, but the problem we saw with this is that using inner_hits requires knowledge of the nested paths and which objects are nested. While this information should theoretically be accesible I was under the impression that Kibana would rather like to query all fields (i.e the wildcard "*") and not have to inspect mappings upfront to know which parts might map to nested fields. Happy to discuss this again if I'm missing something or if this wasn't what you were refering to.

@jtibshirani
Copy link
Contributor Author

Sorry for the confusion, I'm referring to the first option in the description where we return a response like this:

"fields": {
  "products": [{
    "base_price": [43.99],
    "manufacturer": ["Microlutions"],
    "product_id": [17426],
  },
  {
    "base_price": [20.99],
    "manufacturer": ["Elitelligence"],
    "product_id": [19288],
  }]
}

We separate out each nested document in the response, so the structure is preserved. Each nested section contains the 'fields' output, so multi-fields are included, formatting is applied, etc. There is no duplication of data -- for nested documents we only return this structured format instead of the flattened and combined fields.

cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 5, 2021
At the moment, the ‘fields’ API handles nested fields the same way it handles
non-nested object arrays: it just returns them in a flat list. However, the
relationship between nested fields is something we should try to preserve, since
this is the main purpose of mapping something as “nested” instead of just using
an object.

This PR changes this by returning grouped field values that are inside a nested
object according to the nested object they initially appear in. Any further
object structures inside a nested object are again returned as a flattened list.
Fields inside nested fields don’t appear in the flattened response outside of
the nested path any more. The grouping of fields inside nested objects is
applied recursively if nested mappings are defined inside another nested
mapping.

Closes elastic#63709
cbuescher pushed a commit that referenced this issue Feb 5, 2021
At the moment, the ‘fields’ API handles nested fields the same way it handles
non-nested object arrays: it just returns them in a flat list. However, the
relationship between nested fields is something we should try to preserve, since
this is the main purpose of mapping something as “nested” instead of just using
an object.

This PR changes this by returning grouped field values that are inside a nested
object according to the nested object they initially appear in. Any further
object structures inside a nested object are again returned as a flattened list.
Fields inside nested fields don’t appear in the flattened response outside of
the nested path any more. The grouping of fields inside nested objects is
applied recursively if nested mappings are defined inside another nested
mapping.

Closes #63709
cbuescher pushed a commit to cbuescher/elasticsearch that referenced this issue Feb 8, 2021
This change adds a paragraph on the different response format for nested fields
in the fields API and adds an example snippet.

Related to elastic#63709
cbuescher pushed a commit that referenced this issue Feb 9, 2021
This change adds a paragraph on the different response format for nested fields
in the fields API and adds an example snippet.

Related to #63709
cbuescher pushed a commit that referenced this issue Feb 9, 2021
This change adds a paragraph on the different response format for nested fields
in the fields API and adds an example snippet.

Related to #63709
@dimitris-athanasiou
Copy link
Contributor

A bit late into this discussion but I would like to add another use case. For clients that are trying to access the value of a field that is a child of a nested field (and possible could have another nested field in the path? not sure that's valid), it is quite hard to programmatically get those values from the SearchHit. You need to know which fields in the full field path are nested and break up the path accordingly. So, for a.b.keyword, where a is nested, you'd need to first fetch a and then b.keyword. In automatic field discovery use cases (like what we're trying to offer in ML), this is a bit tricky to implement (doable but complex as we have to work on what the field caps API tells us).

I wonder if it would be possible to change SearchHit.field(String) method to handle full paths without changing the format of the response of the search API.

@cbuescher
Copy link
Member

I wonder if it would be possible to change SearchHit.field(String) method to handle full paths without changing the format of the response of the search API.

I understand the problem but I'm on the fence with this request. I would prefer if the java API would mirror the structure of the REST output in the way if presents retrievable keys etc...
Would it be helpful to maybe add an alternative way to the SearchHit API or some utililty that would retrieve full paths even if they target fields inside nested structures? In any case, I think we should discuss this as a potential enhancement request in a new issue.

@shiroorg
Copy link

shiroorg commented May 7, 2021

A very pressing question, how to return the map to be generated according to the old scheme?

Due to the transition to from version 7.10 -> 7.12, the generation completely broke, while the options for "disabling the new circuit" are not provided or are missing in the documentation

@cbuescher
Copy link
Member

Due to the transition to from version 7.10 -> 7.12, the generation completely broke

I'm sorry to hear this is causing problems on your end. However, this is something we reserve to be able to do on "beta" features like the "fields" option in search (see status warning on the 7.10 documentation page). The decision to finalize the API and remove the "beta" status was made in #71130.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

6 participants