-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Optimize source extraction for categorize_text aggregation #79099
[ML] Optimize source extraction for categorize_text aggregation #79099
Conversation
Doing some testing on some more realistic file-beat data indicates that this improvement is not very big. For 100k + documents (with 10k+ with fields larger than 4kb in size), the improvement in categorization (measured by the I will leave this in draft, but it seems that this isn't a big an improvement given my current data set tests. |
Running more tests on a much larger (some filebeat data) dataset provided more conclusive results. Categorizing 27,104,873 messages across 8 shards, the optimized version executed about 40 seconds faster, which amounted to about a ~11% runtime improvement. The number of fields present in the source field was about 62 and none of the fields were exceptionally large. If we see about 11% improvement when the number of fields is so small and the size of _source is not large, then I expect to see a more drastic difference with larger _source files. |
@elasticmachine update branch |
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was curious how uncompression/deserialization is avoided. I think this is a misunderstanding, parts are just moved to a lower level.
This is still a great change!
* The major difference with {@link SourceLookup#extractRawValues(String)} is that this version will: | ||
* | ||
* - not cache source if it's not already parsed | ||
* - will only extract the desired values from the compressed source instead of deserializing the whole object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deserializing the whole object
If I read the source correctly it still does, it just filters them early, still the whole object is parsed. This save cycles which is great, however I think this documentation claim is not correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does pull the source object into memory (the compressed bytes), but it only parses out the specific included fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - the _source
is still decompressed but we can skip good chunks of the json parsing. The biggest savings so far as I can tell is in not allocating memory to hold the parsed results. There are smaller advantages in that we can skip some cycles around parsing too, but the copying seems like the biggest thing to me.
* The major difference with {@link SourceLookup#extractRawValues(String)} is that this version will: | ||
* | ||
* - not cache source if it's not already parsed | ||
* - will only extract the desired values from the compressed source instead of deserializing the whole object |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - the _source
is still decompressed but we can skip good chunks of the json parsing. The biggest savings so far as I can tell is in not allocating memory to hold the parsed results. There are smaller advantages in that we can skip some cycles around parsing too, but the copying seems like the biggest thing to me.
* @param path The path from which to extract the values from source | ||
* @return The list of found values or an empty list if none are found | ||
*/ | ||
public List<Object> extractRawValuesOptimized(String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimized
makes me thing extractRawValues2
or something. Maybe WithoutCaching
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe WithoutCaching or something.
That seems good to me. Makes it clear we are not caching once we get it.
if (source != null) { | ||
return XContentMapValues.extractRawValues(path, source); | ||
} | ||
FilterPath[] filterPaths = FilterPath.compile(Set.of(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are fast to compile now but I think they'll get slower. Right now they just sort of shift the lists around and do some string manipulation.
I wonder if in the future we should make a
SourcePathLookup lookup(String path)
or something. That'd be a neat. It can wait.
XContentParser createParser(NamedXContentRegistry xContentRegistry, | ||
DeprecationHandler deprecationHandler, byte[] data, int offset, int length, FilterPath[] includes, | ||
FilterPath[] excludes) throws IOException; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was last here I thought "we really should do something about all these overrides. That's still true. But I don't want to block this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There do seem to be a ton of unnecessary overrides that are just there to allow other overrides. I didn't want to tackle that :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I'd keep them for now. I'll still grumble about them. But i'll have a think about if we can do something.
length, | ||
include, | ||
exclude) | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this a lot. There has to be a better way..... I'm sure there is, but we're in a local minima of effort here. This is a fine change, but I think I need to add some refactoring here to my personal backlog.
…tic#79099) This optimizes the text value extraction from source in categorize_text aggregation. Early measurements indicate that the bulk of the time spent in this aggregation is inflating and deserializing the source. We can optimize this a bit (for larger sources) by only extracting the text field we care about. The main downside here is if there is a sub-agg that requires the source, the that agg will need to extract the entire source again. This should be a rare case. NOTE: opening as draft as measurements need to be done on some realistic data to see if this actually saves us time. This takes advantage of the work done here: elastic#77154
…es from _source (#79651) Previously created optimizations: - Enabling XContentParser filtering for specific fields: #77154 - Retrieving only specific fields from _source: #79099 Allow significant_text to only parse out the specific field it requires from `_source`. This allows for a minor optimization when `_source` is small, but can be significant (pun intended) when `_source` contains many fields, or other larger fields that `significant_text` doesn't care about. Since `significant_text` does not allow sub-aggs, not caching the parsed `_source` is reasonable.
This optimizes the text value extraction from source in categorize_text aggregation.
Early measurements indicate that the bulk of the time spent in this aggregation is inflating and deserializing the source. We can optimize this a bit (for larger sources) by only extracting the text field we care about.
The main downside here is if there is a sub-agg that requires the source, the that agg will need to extract the entire source again. This should be a rare case.
NOTE: opening as draft as measurements need to be done on some realistic data to see if this actually saves us time.
This takes advantage of the work done here: #77154