Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field key suggester API for flattened field #73968

Closed
mayya-sharipova opened this issue Jun 9, 2021 · 12 comments
Closed

Field key suggester API for flattened field #73968

mayya-sharipova opened this issue Jun 9, 2021 · 12 comments
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 9, 2021

We need a new API that for a given flattened field lists field keys starting with a given string.

For example, the request below lists first 10 field keys for x_pack_telemetry flattened field that start with "x_pack_telemetry.stack_stats".

GET my_index/_fields_enum
{
    "field" : "x_pack_telemetry",
    "string" : "x_pack_telemetry.stack_stats",
    "size" : 10
}

Implementation-wise this could be either:

  • similar to _terms_enum API.
  • we would index all keys in the JSON object under a special subfield like _keys -- then a terms aggregation could be run on my_flat_object._keys to provide a list of available keys.
@mayya-sharipova mayya-sharipova added >enhancement :Search/Search Search-related issues that do not fall into other categories labels Jun 9, 2021
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jun 9, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jtibshirani
Copy link
Contributor

When I filed #43805, I initially envisioned a solution specific to flattened fields. But I think we should instead consider a general API to provide field name suggestions for all fields in the matching indices. This API would handle flattened subfields 'under the hood' and surface them as suggestions. A general API seems easier to work with for clients like Kibana, so they don't need to track which fields are non-standard like flattened and issue special requests.

@markharwood
Copy link
Contributor

markharwood commented Jun 16, 2021

I'd be happy to pick this up because the implementation would be similar to the terms_enum api I just finished working on if that's OK with you. I'm assuming it has the same auto-complete-as-you-type performance requirement?

@markharwood
Copy link
Contributor

Some questions on approach:

Prefix search or infix search?

Unlike the terms_enum api I wonder if this API should offer infix search?
I see existing Kibana field-pickers offer infix matching:
Github_commits__elasticsearch_-_Elastic

That style of matching could be useful e.g. when there's a lot of dotted field names with a lot of depth - you might not be sure which container object holds a host property. The downside is a flattened field could theoretically have millions of results and take extensive time to run an infix search. The same performance questions apply to other forms of expensive matching like fuzzy or wildcard. Also, a flattened field with millions of values may "pollute" the results and mean that attempts to find anything else are swamped by these many flattened values because prefix matching isn't being used to rule out the flattened field's root name from the search scope.

Type filtering?

When Kibana users are selecting a field to perform an action they might only be interested in matching field names that are also of a particular type e.g. this Lens filter:
Lens_-_Elastic
This could be implemented in the elasticsearch API or as a post-filter applied by Kibana code.
Flattened fields (as far as I understand) are always just a simple keyword type however so maybe type filtering should just be a Kibana-side concern.

Results format

Do we simply return arrays of matching field names or do we have more detailed objects capable of holding other information like field types, isAlias, isRuntime, usage counts etc?

Implementation

Do we:
a) Copy most of the existing terms_enum classes and more or less substitute "fields" for "terms" names? Or
b) Create base classes for TimeCriticalRequests/Responses/Actions and make TermsEnumXXX and FieldsEnumXXX subclasses for this family of time-critical lookups? Or
c) Refactor existing TermsEnum classes to have a new mode flag which determines if we're looking at the left or right hand side of JSON's "field" : "term" structure?

Option C) would only work if some of the answers to my earlier questions meant that terms_enum and fields_enum requests were functionally the same - prefix matching only and only returns lists of strings.

@jpountz
Copy link
Contributor

jpountz commented Jun 17, 2021

we would index all keys in the JSON object under a special subfield like _keys -- then a terms aggregation could be run on my_flat_object._keys to provide a list of available keys.

I don't think we need to resort to indexing more data. We can already know field names with the data we have today: for common field types, these are field names in the mappings, and for flattened fields, these are the prefixes of terms in the terms dictionary associated with the field?

Unlike the terms_enum api I wonder if this API should offer infix search?
you might not be sure which container object holds a host property

I agree with that argument: as a user I would really like that the UI suggest book.author if I start typing just aut. But like you said, flattened fields could make this run very slowly. Maybe we could bound the number of operations that we allow ourselves to run on terms enums, and have a complete flag like the _terms_enum API has in order to signal when we couldn't return an exhaustive answer?

Type filtering

One benefit I can see of supporting this is that if we get a request for field name suggestions of numeric fields, then we could skip all flattened fields?

Results format

I think it would be frustrating if we required an additional call to the _field_caps API in order to figure out important properties of the field, so I'd be +1 on returning additional metadata that is important to build applications on top of Elasticsearch.

Implementation

I usually prefer copying code (option a), which would make it easier for field name suggestions and field value suggestions to evolve independently.

@markharwood
Copy link
Contributor

I wonder if Kibana should use a hybrid model and only rely on an elasticsearch API for looking up the flattened fields.
The reasons being:

  1. Kibana already caches browser-side indexPattern objects with all the non-flattened field names/types and features. Filtering this list with Javascript code can be simple and fast.
  2. Maybe it's useful to know an elasticsearch API only returns flattened fields if we want to cap the number of flattened field name suggestions mixed in with the local "real" field matches.

As I mentioned earlier, flattened fields are an uncontrolled part of the schema (docs can introduce millions of unique values) so rogue docs have the potential to flood the top suggestions with garbage, especially if we are going to rely on infix matching to surface results.
Ideally the top results, in order, might be:

  1. Matching physical fields from the controlled mapping followed by
  2. Any matching flattened field names generated by indexed docs.

@markharwood
Copy link
Contributor

markharwood commented Jun 21, 2021

@jpountz one aspect of this has been tricky to code so I wanted to check some assumptions:

The Lucene indexed terms for flattened fields only contain the bit of the field name from the object onwards so if the flattened object is called foo and the property is called bar then a user might do an infix search that straddles the elasticsearch mapping name and the indexed Lucene part e.g. searching for pattern oo.bar might be expected to match foo.bar.
This is the part that is proving messy to implement so I wanted to confirm matching across dot boundaries is a requirement before continuing.

Update

I got the matching across dot boundaries working OK and I assume this is the desired behaviour.

@jtibshirani
Copy link
Contributor

I wonder if Kibana should use a hybrid model and only rely on an elasticsearch API for looking up the flattened fields.

To me it still seems cleanest and easiest to handle if we had a unified field suggestion API. Even with this API, we have some flexibility to determine how suggestions are produced. So to ensure diverse suggestions, maybe we could directly incorporate some some of your ideas (like listing mapped fields above flattened subfields) and just document how the API makes these trade-offs. I'd be really curious if our future users in Kibana have an opinion on this point too.

@markharwood
Copy link
Contributor

markharwood commented Jun 22, 2021

Currently I have an implementation where search string patterns will match any part of the logical field name regardless of how that is held physically i.e. there is an assumption that searching for oo.ba will match abc.def.foo.bar regardless of the field types of abc, def and foo (object or flattened).

There is the question of performance though.

I agree with that argument: as a user I would really like that the UI suggest book.author if I start typing just aut. But like you said, flattened fields could make this run very slowly

@jimczi suggested the flattened cost would be too high to bear and we should restrict matching on flattened fields to be prefix-based.
This potentially raises several questions which would be good to clarify using an example mapping.
So given a mapping with these properties:

text_field_foo
object_field_foo
    bar_bytes
flattened_field_foo
    bar 
        bytes
        price
    ... many other field names
  1. Search foo should probably match text_field_foo and object_field_foo but should it also return
    a) object_field_foo.bar_bytes?
    b) flattened_field_foo.bar.bytes and flattened_field_foo.bar.price and ... many other field names?
  2. Should search bytes match
    a) object_field_foo.bar_bytes?
    b) flattened_field_foo.bar.bytes?
  3. Should search bar match
    a) object_field_foo.bar_bytes?
    b) flattened_field_foo.bar.bytes?
    c) flattened_field_foo.bar.price?

If we want to avoid the cost of things like 2b by requiring a prefix to match inside flattened fields then the user would have to type a full stop e.g. foo. to start to see some of what's inside the flattened field.

The difference between leaf and branch nodes.

One of the usability questions raised here is that this probably introduces two types of fields when it comes to suggestions:

  1. "Leaf" fields which you can go ahead and search or aggregate on and
  2. "branch" fields like objects or flattened which you can't use for search/aggs but have to type a full-stop after to "step into" further suggestions for leaf fields.
    This is similar to how shell command lines offer autocomplete of file names:

elasticsearch-8_0_0-SNAPSHOT_—_-bash_—_228×57_and_Slack_____Paul_Sanwald___Elastic

@jpountz
Copy link
Contributor

jpountz commented Jun 23, 2021

@jimczi suggested the flattened cost would be too high to bear and we should restrict matching on flattened fields to be prefix-based.

Wouldn't this concern be addressed by having an upper limit on the number of field names that we look at for flattened fields like I suggested in my previous comment?

If possible it would be nice if moving from object to flattened would be guaranteed to never return less suggestions to avoid giving users reasons not to move to flattened when it is a better option than many keyword fields. I suspect that in some cases users will want to move to flattened because the number of keys is theoretically unbounded though it is practically bounded to a reasonable number. Since mappings are not allowed to have more than 1000 fields, one way to do this would be to only check the first 1000 field names that we see on every flattened field?

@markharwood
Copy link
Contributor

Wouldn't this concern be addressed by having an upper limit on the number of field names that we look at for flattened fields like I suggested in my previous comment?

We currently have time as a limiting factor and in the terms_enum PR I originally prototyped a number-of-scanned-terms limit.

Either way - @jimczi has always had an objection to limits like this where the execution cost on large datasets always runs to some worst-case upper limit of the largest-tolerable setting rather than offering functionality that we can maintain fixed, small look-up costs that don't increase with data sizes. This fixed cost is done by limiting functionality e.g. offering prefix search only.

@markharwood
Copy link
Contributor

Closing in favour of #74816 which provides name suggestion for all field types

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants