A high level way of retrieving values for certain fields #49028

dimitris-athanasiou · 2019-11-13T09:23:33Z

Describe the feature:

More and more use cases arise that treat elasticsearch as a data store. Yet the landscape for retrieving fields today is complex. In fact, it requires expertise about a lot of different aspects. One needs to understand mappings, doc_values, stored fields. Complexities like becoming aware of the max doc_value field limit and then working around it by detecting a user requested more fields and trying to fetch them from _source instead.

Then, of course, there is multi-fields. Which variant should I pick? How do I even detect that a field has multi-fields in order to avoid retrieving the same field multiple times? There is an answer to this of course (check there is a parent field that is not an object) but this is hopefully illustrating how complex this is.

Writing code to do this for ML I have multiple stories about the complexities that arise. I think other users must have gone through a similar process.

I propose a new API that simply retrieves values given a list of fields. The API does not intend to do this in the most performant way. Rather, it intends to do it in the most user friendly way. It is an API that targets users that do not know the inner workings of elasticsearch and that have not yet detected a performance issue so that they begin an optimization journey (see "is it faster to retrieve from _source or doc_values" types of questions).

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-11-13T09:23:35Z

Pinging @elastic/es-search (:Search/Search)

jimczi · 2019-11-18T19:16:20Z

We discussed this issue in our search meeting and we've spotted two enhancements that could help to retrieve values more easily:

The field_caps API should expose the source path of the field if it's not present in the _source (alias, multi-fields, ...): Add source_path information to field_caps API #49264
The format of values when retrieving the _source should be customizable in order to allow a date for instance to be returned as a timestamp since epoch rather than a string. This feature would be equivalent to the format option of the docvalues_field but it would be applied in the original source directly.

costin · 2020-02-10T16:14:49Z

Discussed in the meeting today, adding team-discuss to clarify the remaining scope once @jimczi is back (are we okay with the current plan or do we need to do a higher level api to handle the retrieval).

joshdevins · 2020-02-10T16:16:48Z

I can imagine this as being necessary as well for feature extraction for our planned LTR work, both at training and inference time to extract document only features (i.e. features that are not query/context dependent).
/cc @davidkyle @jtibshirani

wylieconlon · 2020-02-20T20:15:33Z

We have run into this problem in Kibana, where we are primarily asking users to interact with dotted field names like system.cpu.user.pct or url.keyword in building their visualizations.
Because the dotted names are what we train users to see, we keep a cache of the dotted names from the field_caps API (the index pattern object), and use this when asking users to build queries or visualizations. Why don't the _search APIs construct dotted paths for us?

Proposal: Add a new parameter fields to the _search API which implements the high-level retrieval described here, combining the behavior of _source and docvalue_fields. It is important for use in Kibana to support unlimited wildcards. It is important for us to be able to display the entire document using a query like fields: '*' or fields: ['system.cpu.*'].

The kibana sample data contains both text and keyword mappings, and is a good illustration of the response shape that I would expect:

POST kibana_sample_data_logs/_search
{
  "query": { "match_all": {} },
  "_source": "",
  "fields": [{ "field": "*" }],
  "size": 10
}

"fields": {
  "bytes": [ 8679 ],
  "extension": "",
  "extension.keyword" : [ "" ],
  "geo.coordinates" : [ "32.69899999257177, -94.94886112399399" ],
  "geo.src" : [ "CN" ],
  "geo.dest" : [ "IT" ],
  "geo.srcdest" : [ "CN:IT" ]
  "host": "www.elastic.co",
  "host.keyword" : [ "www.elastic.co" ],
  "machine.os" : "win xp",
  "machine.os.keyword" : [ "win xp" ],
  "machine.ram" : 11811160064,
  "response": 200,
  "response.keyword": ["200"],
  "tags": ["success","info"],
  "tags.keyword": ["info", "success", "info", "success"],
}

The example request is easy to write for any user of Elasticsearch, and the response contains information that is from both doc_values and _source. This is a simple, high-level API that we could work with. Unfortunately, this isn't possible by combining any of the APIs that exist today for a few reasons.

Limitations of current APIs

I have been testing with ECS-based schemas like metricbeat, which on my cluster contains 3904 named paths in the mapping. Not all of these fields are actively used, but because the mapping is so large it causes problems. Here are the limitations I've found

_source: "*" does not include multi-mapped or alias fields

Making a _source request with a list of 3904 paths like _source: [...] causes the error:

{
  "type" : "too_complex_to_determinize_exception",
  "reason" : "Determinizing automaton with 235539 states and 239442 transitions would result in more than 10000 states."
}

It's not possible to get all docvalues with a wildcard on small indices. The query docvalue_fields: [{ field: "*" }] throws an error if there are any text fields at all:

Fielddata is disabled on text fields by default. Set fielddata=true on [request] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
It's not possible to get all docvalues on a large mapping like metricbeat. The request docvalue_fields: [{ field: "*" }] causes the error

Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [2588]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.
Listing too many paths in the request for docvalue_fields also causes the same error:

Trying to retrieve too many docvalue_fields. Must be less than or equal to: [100] but was [3900]. This limit can be set by changing the [index.max_docvalue_fields_search] index level setting.

All of these limitations make it hard to avoid using _source.

jtibshirani · 2020-03-02T06:26:09Z

I caught up with @jimczi offline to clarify our earlier discussion. Instead of immediately pushing ahead with the source_path (#49264) and formatters changes, we'd like to step back and consider the problem in a more end-to-end way. Like this, we can consider a coordinated API change that addresses the use case in a more direct + user-friendly way.

We can continue the discussion about field retrieval on this issue, building on @wylieconlon's helpful analysis. I'll remove 'team discuss' for now, but we can add it back if there's a particular item we'd like to discuss in person.

jpountz · 2020-03-06T09:00:17Z

+1 to move forward with something along the lines of @wylieconlon 's above proposal.

jtibshirani · 2020-03-10T23:10:28Z

Great, I've assigned this to myself and am working on a design doc. Once the design is more settled I'll post it here or open a new meta-issue.

jtibshirani · 2020-04-17T17:16:08Z

I opened a meta-issue to track implementation details: #55363.

…60100) This feature adds a new `fields` parameter to the search request, which consults both the document `_source` and the mappings to fetch fields in a consistent way. The PR merges the `field-retrieval` feature branch. Addresses #49028 and #55363.

…60258) This feature adds a new `fields` parameter to the search request, which consults both the document `_source` and the mappings to fetch fields in a consistent way. The PR merges the `field-retrieval` feature branch. Addresses #49028 and #55363.

jtibshirani · 2020-07-28T20:58:42Z

Closing, since the feature branch was merged in #60100.

dimitris-athanasiou added >feature :Search/Search Search-related issues that do not fall into other categories labels Nov 13, 2019

dimitris-athanasiou added the discuss label Nov 13, 2019

jimczi added team-discuss and removed discuss labels Nov 14, 2019

jimczi removed the team-discuss label Nov 18, 2019

costin added the team-discuss label Feb 10, 2020

wylieconlon mentioned this issue Feb 20, 2020

[Lens] Field existence endpoint uses three APIs instead of one elastic/kibana#56902

Closed

jtibshirani removed the team-discuss label Mar 2, 2020

jpountz mentioned this issue Mar 6, 2020

Add source_path information to field_caps API #49264

Closed

davidkyle mentioned this issue Mar 6, 2020

Enrich documents with inference results at Fetch #53230

Merged

jtibshirani self-assigned this Mar 6, 2020

jtibshirani mentioned this issue Mar 11, 2020

Add support for source_path to the field caps API. #52345

Closed

wylieconlon mentioned this issue Mar 11, 2020

Improve handling of multi-fields in Discover elastic/kibana#7419

Closed

dimitris-athanasiou changed the title ~~A high level of retrieving values for certain fields~~ A high level way of retrieving values for certain fields Mar 13, 2020

javanna mentioned this issue Mar 27, 2020

Could field value access be simplified? #24036

Closed

wylieconlon mentioned this issue Apr 15, 2020

[Lens] Fields which contain periods in _source don't show up in field list elastic/kibana#63630

Closed

mattkime mentioned this issue Apr 16, 2020

Support for new field capabilities api elastic/kibana#63716

Closed

jtibshirani mentioned this issue Apr 16, 2020

Search 'fields' option design + implementation #55363

Closed

10 tasks

rjernst added the Team:Search Meta label for search team label May 4, 2020

mayya-sharipova mentioned this issue Jun 18, 2020

Multi-fields information in field caps API #58357

Closed

jtibshirani mentioned this issue Jul 23, 2020

Add search 'fields' option to support high-level field retrieval. #60100

Merged

jtibshirani closed this as completed Jul 28, 2020

Mpdreamz mentioned this issue Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this issue Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

nreese mentioned this issue Feb 25, 2021

[Maps] use _search.fields API to retrieve document field values elastic/kibana#92872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A high level way of retrieving values for certain fields #49028

A high level way of retrieving values for certain fields #49028

dimitris-athanasiou commented Nov 13, 2019

elasticmachine commented Nov 13, 2019

jimczi commented Nov 18, 2019

costin commented Feb 10, 2020

joshdevins commented Feb 10, 2020

wylieconlon commented Feb 20, 2020

jtibshirani commented Mar 2, 2020

jpountz commented Mar 6, 2020

jtibshirani commented Mar 10, 2020

jtibshirani commented Apr 17, 2020 •

edited

Loading

jtibshirani commented Jul 28, 2020

A high level way of retrieving values for certain fields #49028

A high level way of retrieving values for certain fields #49028

Comments

dimitris-athanasiou commented Nov 13, 2019

elasticmachine commented Nov 13, 2019

jimczi commented Nov 18, 2019

costin commented Feb 10, 2020

joshdevins commented Feb 10, 2020

wylieconlon commented Feb 20, 2020

Limitations of current APIs

jtibshirani commented Mar 2, 2020

jpountz commented Mar 6, 2020

jtibshirani commented Mar 10, 2020

jtibshirani commented Apr 17, 2020 • edited Loading

jtibshirani commented Jul 28, 2020

jtibshirani commented Apr 17, 2020 •

edited

Loading