-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indicate why a field has been _ignored
#101153
Comments
Pinging @elastic/es-search (Team:Search) |
I have mixed feelings about this proposal. One the one end we like enriching our indexes to be able to provide more context. On the other hand, it ends up making our indexes feel heavy, sometimes with more metadata than actual data. I suspect that this field wouldn't add much footprint on its own, but then when you multiply with all the metadata fields that we enrich our data with, it ends up being significant. I'm also not sure where it should end, e.g. would we also need to include the length as Elasticsearch calculated it vs. the one configured in mappings? Is this something that can be solved differently? E.g. maybe we could have a validation API that takes mappings and a documents and simulates indexing checks? (I don't like this idea especially, it's mostly to highlight that there might be different approaches.) |
I get your point about the bloat that these fields add and I agree that we need to be conservative when it comes to adding metadata fields, especially those that are hard to compress. However, in contrast to other metadata fields, I guess it comes down to the question if this can be solved another way, without adding additional metadata to the documents.
I don't think that storing the size of the field that was ignored because of
Let me think out loud a bit on how such an API could work. It would need to be an algorithm that takes the _source, the _ignored fields, and the index mapping as an input and tries to determine for which reason one or multiple fields have been ignored. You can retrieve the ignored value by looking it up from
While not bullet-proof, this is probably a good enough heuristic. It's also something that may be implemented outside of Elasticsearch, for example in the document flyout in discover. @ruflin WDYT? |
Can you share some more details why you picked the example you have above over:
As I would assume the same/similar fields will hit the isssue often again and again, this would hopefully compress well. I like the idea of investigating different approaches, but not too big of a fan of trying to guess what the error was. With the initial proposal, it is on each document very clear on what wrong and it can be indicated what needs to fixing. Existing doc can be reingested/simulated and it can be immediately checked if something went wrong without running heavy comparisons across the full index. In an ideal scenario, all documents are healthy and that is where we should be heading. If there are errors, the goal is to get them resolved / cleaned up. This will save storage. Thinking about different implementations: What if we would store the errors in a different place? Thinking of it like a lookup index. The reference would be the hash of the error. Like this, each error (which likely shows up many times) is only persisted once and no matter how many fields in the error doc, the document itself keeps only the hash. During query time, doc and error doc would be combined. It would also make it possible to wipe the error index separately for cleanup purpose. |
I was thinking the same as @ruflin - storing a field for each failure type. Note that if we do something like that, there is another storage gain, as @felixbarny mentioned - when adding Either way, I think #101373 should be blocked until we finalize this discussion, as it would greatly affect how we change this mapper. |
I like the direction you are taking here @eyalkoren . Is my understanding correct that having specific For the migration, could we introduce a setting on the data stream? For example having @jpountz WDYT of an approach like above? |
IIUC, in order to be able to aggregate,
I think that if there's a good way to do that generically within Elasticsearrch mapping logic, that would be preferable. Then it would be completely transparent and the functionality of the ignore reason would be available to all new indices. The |
Quick update: We had a discussion around priority of this with @jpountz and @javanna and the potential review time needed. The conclusion is to first make We will see how far we can take it with this single field. The assumption is that this change will not hold us back from following the approach mentioned in #101153 (comment) later on if we need it. |
We've also discussed that on an aggregate level, it's more useful to find out which fields are ignored rather than why fields have been ignored. Dedicated fields like Once you know that a specific field is commonly ignored on a given data stream, the investigation flow is to look at individual documents to find out the reason. We can and should help users understand why a field has been ignored on a per-document basis. But we don't need to store information for that in the documents. Looking at the document and the mappings contains all the information we need to do that. That's not to say that a dedicated field wouldn't help with doing that, but the priority for introducing such a field is lower. |
_ignored
_ignored
Just wanted to highlight that we decided not to keep the stored field for I was wondering if this could be a feature that we can add to the failure store. I expect that if something ends up there we would like to know why...and I guess storing this kind of information only when a document ends up in the failure store is more acceptable as a trade-off in terms of additional storage. |
As we have the lenient mappings in place, I consider why a some values were not indexed as important as a document ending up in the failure store. I would argue we need the info in both places. The question is how can we store this additional info as cheap as possible. One way out here is to have this as a config option. For example you can turn it on after pipeline / template changes to see if something shows up and for storage saving remove it again. I wonder how much storage overhead of something like |
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
At the moment, a field can be ignored and land in the
_ignored
metadata field either because ofignore_malformed
or because ofignore_above
.@ckauf reported that he heard feedback from a user that really likes the new default setting for
ignore_malformed
but it's not trivial to find out what the reason was that a field has been ignored. It could either be due toignore_malformed
orignore_above
.This ambiguity will get worse with #96235 where fields can also be
_ignored
if the field limit is hit.In this issue, I'd like to discuss options on how we could add an indication for the reason a field ended up being
_ignored
.A potential solution for the would be to store an additional
_ignored_reason
metadata field alongside the_ignored
field. The two fields would both contain an array of strings. We can line up the indices/positions of the two arrays so that we exactly know the reason for why a field has been ignored.For example, if field
foo
has been ignored because ofignore_malformed
andbar
has been ignored because ofignore_above
, we can store something like this:You might think, doesn't Lucene de-duplicate and sort keyword doc_values? Yes, it does, but the
_ignored
field isn't stored in doc_values but in a stored field. While we'll want to add doc_values to_ignored
in the future (see #59946), we don't necessarily need to remove the stored field. This would come at the expense of storage but it would greatly simplify these troubleshooting workflows.The text was updated successfully, but these errors were encountered: