Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indicate why a field has been _ignored #101153

Open
felixbarny opened this issue Oct 19, 2023 · 12 comments
Open

Indicate why a field has been _ignored #101153

felixbarny opened this issue Oct 19, 2023 · 12 comments
Labels
:StorageEngine/Mapping The storage related side of mappings Team:StorageEngine

Comments

@felixbarny
Copy link
Member

At the moment, a field can be ignored and land in the _ignored metadata field either because of ignore_malformed or because of ignore_above.

@ckauf reported that he heard feedback from a user that really likes the new default setting for ignore_malformed but it's not trivial to find out what the reason was that a field has been ignored. It could either be due to ignore_malformed or ignore_above.

This ambiguity will get worse with #96235 where fields can also be _ignored if the field limit is hit.

In this issue, I'd like to discuss options on how we could add an indication for the reason a field ended up being _ignored.


A potential solution for the would be to store an additional _ignored_reason metadata field alongside the _ignored field. The two fields would both contain an array of strings. We can line up the indices/positions of the two arrays so that we exactly know the reason for why a field has been ignored.

For example, if field foo has been ignored because of ignore_malformed and bar has been ignored because of ignore_above, we can store something like this:

{
  "_ignored": ["foo", "bar"],
  "_ignored_reason": ["ignore_malformed", "ignore_above"]
}

You might think, doesn't Lucene de-duplicate and sort keyword doc_values? Yes, it does, but the _ignored field isn't stored in doc_values but in a stored field. While we'll want to add doc_values to _ignored in the future (see #59946), we don't necessarily need to remove the stored field. This would come at the expense of storage but it would greatly simplify these troubleshooting workflows.

@felixbarny felixbarny added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Oct 19, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Oct 19, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@jpountz
Copy link
Contributor

jpountz commented Oct 27, 2023

I have mixed feelings about this proposal. One the one end we like enriching our indexes to be able to provide more context. On the other hand, it ends up making our indexes feel heavy, sometimes with more metadata than actual data.

I suspect that this field wouldn't add much footprint on its own, but then when you multiply with all the metadata fields that we enrich our data with, it ends up being significant. I'm also not sure where it should end, e.g. would we also need to include the length as Elasticsearch calculated it vs. the one configured in mappings?

Is this something that can be solved differently? E.g. maybe we could have a validation API that takes mappings and a documents and simulates indexing checks? (I don't like this idea especially, it's mostly to highlight that there might be different approaches.)

@felixbarny
Copy link
Member Author

I get your point about the bloat that these fields add and I agree that we need to be conservative when it comes to adding metadata fields, especially those that are hard to compress.

However, in contrast to other metadata fields, _ignored_reason would only be added in error cases and when _ignored is already added. I feel like adding debugging information that is essential to diagnose an ingestion issue is a very good use of disk space. Everything that can easily be inferred otherwise isn't.

I guess it comes down to the question if this can be solved another way, without adding additional metadata to the documents.

I'm also not sure where it should end, e.g. would we also need to include the length as Elasticsearch calculated it vs. the one configured in mappings?

I don't think that storing the size of the field that was ignored because of ignore_above would be a good use of disk space as it can easily be inferred by looking at the _source of the document. If you know that field bar has been ignored due to ignore_above, and you can look at the value of bar in the _source of the document, you have all that you need to troubleshoot the issue.

Is this something that can be solved differently? E.g. maybe we could have a validation API that takes mappings and a documents and simulates indexing checks?

Let me think out loud a bit on how such an API could work. It would need to be an algorithm that takes the _source, the _ignored fields, and the index mapping as an input and tries to determine for which reason one or multiple fields have been ignored. You can retrieve the ignored value by looking it up from _source. The possibilities are ignore_malformed, ignore_above, and field_limit.

  • If the field limit is reached and the ignored field is not in the mapping, it's has probably been ignored due to the field limit.
  • Else, if the field has ignore_above set and the value exceeds it, that's probably the reason for why it has been ignored.
  • Else, assume that the reason is due to ignore_malformed as the rules for when a value is malformed is a bit more complex (type conversions, date patterns, ...).

While not bullet-proof, this is probably a good enough heuristic. It's also something that may be implemented outside of Elasticsearch, for example in the document flyout in discover.

@ruflin WDYT?

@ruflin
Copy link
Member

ruflin commented Oct 30, 2023

Can you share some more details why you picked the example you have above over:

{
  "_ignore_malformed": ["foo", "bar"]
  "_ignore_above": ["goo", "gar"]
}

As I would assume the same/similar fields will hit the isssue often again and again, this would hopefully compress well.

I like the idea of investigating different approaches, but not too big of a fan of trying to guess what the error was. With the initial proposal, it is on each document very clear on what wrong and it can be indicated what needs to fixing. Existing doc can be reingested/simulated and it can be immediately checked if something went wrong without running heavy comparisons across the full index.

In an ideal scenario, all documents are healthy and that is where we should be heading. If there are errors, the goal is to get them resolved / cleaned up. This will save storage.

Thinking about different implementations: What if we would store the errors in a different place? Thinking of it like a lookup index. The reference would be the hash of the error. Like this, each error (which likely shows up many times) is only persisted once and no matter how many fields in the error doc, the document itself keeps only the hash. During query time, doc and error doc would be combined. It would also make it possible to wipe the error index separately for cleanup purpose.

@eyalkoren
Copy link
Contributor

eyalkoren commented Oct 31, 2023

I was thinking the same as @ruflin - storing a field for each failure type.
I guess the main question then is what we do with _ignored - dropping it is a breaking change.
Of course we can duplicate values, but that's not too elegant.
Maybe we can solve this with a runtime field, so that new indices created with this change will create only _ignore_<REASON> fields and in addition will have a runtime field mapping for _ignored that collects values from those. I don't know if it is feasible to have such default mapping for all indices in a declarative way, but maybe the IgnoredFieldMapper can encapsulate such change, or be replaced with runtime field mapping.

Note that if we do something like that, there is another storage gain, as @felixbarny mentioned - when adding doc_values to the _ignored field in order to make it aggregatable, we must keep it also stored, in order to maintain the original value. If we go with the _ignore_<REASON> approach, we don't have to have them stored.

Either way, I think #101373 should be blocked until we finalize this discussion, as it would greatly affect how we change this mapper.

@ruflin
Copy link
Member

ruflin commented Oct 31, 2023

I like the direction you are taking here @eyalkoren . Is my understanding correct that having specific _ignore_<REASON> fields with doc_values only will be more compact then _ignored field alone?

For the migration, could we introduce a setting on the data stream? For example having ignored_fields: true by default for now but we can overwrite it for the logs-* templates with ignored_fields: false. At the same time we could deprecate the ignored field to remove it eventually. A similar setting we could have for the new ignored fields and even have it off by default so we can get it in without having it applied everywhere directly but start using it for logs.

@jpountz WDYT of an approach like above?

@eyalkoren
Copy link
Contributor

Is my understanding correct that having specific _ignore_<REASON> fields with doc_values only will be more compact then _ignored field alone?

IIUC, in order to be able to aggregate, doc_values are a must either way. Assuming that _ignored must be stored as well and _ignore_<REASON> don't, I believe that this is the case indeed, but there may be other factors that I overlook.

For the migration, could we introduce a setting on the data stream?

I think that if there's a good way to do that generically within Elasticsearrch mapping logic, that would be preferable. Then it would be completely transparent and the functionality of the ignore reason would be available to all new indices. The _ignored functionality would be maintained for new indices and even if the performance of querying it is degraded, this would be acceptable, especially if we deprecate it.

@ruflin
Copy link
Member

ruflin commented Nov 12, 2023

Quick update: We had a discussion around priority of this with @jpountz and @javanna and the potential review time needed. The conclusion is to first make _ignored a doc value. This change is already in progress and should be a low hanging fruit: #101373 This moves us a big step forward of at least having the info available and being able to provide stats over larger datasets and unblocks us short term for the basic functionality.

We will see how far we can take it with this single field. The assumption is that this change will not hold us back from following the approach mentioned in #101153 (comment) later on if we need it.

@felixbarny
Copy link
Member Author

We've also discussed that on an aggregate level, it's more useful to find out which fields are ignored rather than why fields have been ignored. Dedicated fields like _ignored_reason or _ignored_<reason> are mainly useful for the latter.

Once you know that a specific field is commonly ignored on a given data stream, the investigation flow is to look at individual documents to find out the reason. We can and should help users understand why a field has been ignored on a per-document basis. But we don't need to store information for that in the documents. Looking at the document and the mappings contains all the information we need to do that.

That's not to say that a dedicated field wouldn't help with doing that, but the priority for introducing such a field is lower.

@felixbarny felixbarny changed the title Indicate why a a field has been _ignored Indicate why a field has been _ignored Nov 15, 2023
@salvatore-campagna
Copy link
Contributor

salvatore-campagna commented May 15, 2024

Just wanted to highlight that we decided not to keep the stored field for _ignored as a result of supporting aggregations (#101373 ). Using two "parallel" arrays might result in more confusion for the user because the field and the reason might sort differently as a result of using doc values (only).

I was wondering if this could be a feature that we can add to the failure store. I expect that if something ends up there we would like to know why...and I guess storing this kind of information only when a document ends up in the failure store is more acceptable as a trade-off in terms of additional storage.

@ruflin
Copy link
Member

ruflin commented May 16, 2024

As we have the lenient mappings in place, I consider why a some values were not indexed as important as a document ending up in the failure store. I would argue we need the info in both places. The question is how can we store this additional info as cheap as possible. One way out here is to have this as a config option. For example you can turn it on after pipeline / template changes to see if something shows up and for storage saving remove it again.

I wonder how much storage overhead of something like _ignore_<REASON> would actually be based on the assumption, this is not stored in source directly but somehow in Elasticsearch like _ignored. I'm following the assumption, the same fields will show up many times and the cardinality is low.

@martijnvg martijnvg added :StorageEngine/Mapping The storage related side of mappings and removed :Search Foundations/Mapping Index mappings, including merging and defining field types labels May 31, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine elasticsearchmachine added Team:StorageEngine and removed Team:Search Meta label for search team labels May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:StorageEngine/Mapping The storage related side of mappings Team:StorageEngine
Projects
None yet
Development

No branches or pull requests

7 participants