[ML] Optimize source extraction for categorize_text aggregation #79099

benwtrent · 2021-10-13T19:07:27Z

This optimizes the text value extraction from source in categorize_text aggregation.

Early measurements indicate that the bulk of the time spent in this aggregation is inflating and deserializing the source. We can optimize this a bit (for larger sources) by only extracting the text field we care about.

The main downside here is if there is a sub-agg that requires the source, the that agg will need to extract the entire source again. This should be a rare case.

NOTE: opening as draft as measurements need to be done on some realistic data to see if this actually saves us time.

This takes advantage of the work done here: #77154

benwtrent · 2021-10-14T14:26:43Z

Doing some testing on some more realistic file-beat data indicates that this improvement is not very big.

For 100k + documents (with 10k+ with fields larger than 4kb in size), the improvement in categorization (measured by the took field) is less than 100ms.

I will leave this in draft, but it seems that this isn't a big an improvement given my current data set tests.

benwtrent · 2021-10-14T16:42:53Z

Running more tests on a much larger (some filebeat data) dataset provided more conclusive results.

Categorizing 27,104,873 messages across 8 shards, the optimized version executed about 40 seconds faster, which amounted to about a ~11% runtime improvement.

The number of fields present in the source field was about 62 and none of the fields were exceptionally large.

If we see about 11% improvement when the number of fields is so small and the size of _source is not large, then I expect to see a more drastic difference with larger _source files.

benwtrent · 2021-10-14T16:42:58Z

@elasticmachine update branch

elasticmachine · 2021-10-14T17:00:25Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

elasticmachine · 2021-10-14T17:00:25Z

Pinging @elastic/ml-core (Team:ML)

hendrikmuhs

I was curious how uncompression/deserialization is avoided. I think this is a misunderstanding, parts are just moved to a lower level.

This is still a great change!

hendrikmuhs · 2021-10-20T12:33:20Z

server/src/main/java/org/elasticsearch/search/lookup/SourceLookup.java

+     * The major difference with {@link SourceLookup#extractRawValues(String)} is that this version will:
+     *
+     *  - not cache source if it's not already parsed
+     *  - will only extract the desired values from the compressed source instead of deserializing the whole object


deserializing the whole object

If I read the source correctly it still does, it just filters them early, still the whole object is parsed. This save cycles which is great, however I think this documentation claim is not correct

It does pull the source object into memory (the compressed bytes), but it only parses out the specific included fields.

Yeah - the _source is still decompressed but we can skip good chunks of the json parsing. The biggest savings so far as I can tell is in not allocating memory to hold the parsed results. There are smaller advantages in that we can skip some cycles around parsing too, but the copying seems like the biggest thing to me.

nik9000 · 2021-10-21T13:40:30Z

server/src/main/java/org/elasticsearch/search/lookup/SourceLookup.java

+     * The major difference with {@link SourceLookup#extractRawValues(String)} is that this version will:
+     *
+     *  - not cache source if it's not already parsed
+     *  - will only extract the desired values from the compressed source instead of deserializing the whole object


Yeah - the _source is still decompressed but we can skip good chunks of the json parsing. The biggest savings so far as I can tell is in not allocating memory to hold the parsed results. There are smaller advantages in that we can skip some cycles around parsing too, but the copying seems like the biggest thing to me.

nik9000 · 2021-10-21T13:57:17Z

server/src/main/java/org/elasticsearch/search/lookup/SourceLookup.java

+     * @param path The path from which to extract the values from source
+     * @return The list of found values or an empty list if none are found
+     */
+    public List<Object> extractRawValuesOptimized(String path) {


Optimized makes me thing extractRawValues2 or something. Maybe WithoutCaching or something.

Maybe WithoutCaching or something.

That seems good to me. Makes it clear we are not caching once we get it.

nik9000 · 2021-10-21T14:00:20Z

server/src/main/java/org/elasticsearch/search/lookup/SourceLookup.java

+        if (source != null) {
+            return XContentMapValues.extractRawValues(path, source);
+        }
+        FilterPath[] filterPaths = FilterPath.compile(Set.of(path));


These are fast to compile now but I think they'll get slower. Right now they just sort of shift the lists around and do some string manipulation.

I wonder if in the future we should make a

SourcePathLookup lookup(String path)

or something. That'd be a neat. It can wait.

nik9000 · 2021-10-21T15:08:14Z

libs/x-content/src/main/java/org/elasticsearch/xcontent/XContent.java

+    XContentParser createParser(NamedXContentRegistry xContentRegistry,
+                                DeprecationHandler deprecationHandler, byte[] data, int offset, int length, FilterPath[] includes,
+                                FilterPath[] excludes) throws IOException;
+


When I was last here I thought "we really should do something about all these overrides. That's still true. But I don't want to block this change.

There do seem to be a ton of unnecessary overrides that are just there to allow other overrides. I didn't want to tackle that :D

Yeah. I'd keep them for now. I'll still grumble about them. But i'll have a think about if we can do something.

nik9000 · 2021-10-21T15:11:43Z

server/src/main/java/org/elasticsearch/common/xcontent/XContentHelper.java

+                length,
+                include,
+                exclude)
+        ) {


Wow this a lot. There has to be a better way..... I'm sure there is, but we're in a local minima of effort here. This is a fine change, but I think I need to add some refactoring here to my personal backlog.

…tic#79099) This optimizes the text value extraction from source in categorize_text aggregation. Early measurements indicate that the bulk of the time spent in this aggregation is inflating and deserializing the source. We can optimize this a bit (for larger sources) by only extracting the text field we care about. The main downside here is if there is a sub-agg that requires the source, the that agg will need to extract the entire source again. This should be a rare case. NOTE: opening as draft as measurements need to be done on some realistic data to see if this actually saves us time. This takes advantage of the work done here: elastic#77154

…es from _source (#79651) Previously created optimizations: - Enabling XContentParser filtering for specific fields: #77154 - Retrieving only specific fields from _source: #79099 Allow significant_text to only parse out the specific field it requires from `_source`. This allows for a minor optimization when `_source` is small, but can be significant (pun intended) when `_source` contains many fields, or other larger fields that `significant_text` doesn't care about. Since `significant_text` does not allow sub-aggs, not caching the parsed `_source` is reasonable.

[ML] optimize source extraction for categorize_text aggregation

21bae1f

benwtrent added :Analytics/Aggregations Aggregations :ml Machine learning v8.0.0 labels Oct 13, 2021

elasticmachine and others added 2 commits October 14, 2021 12:43

Merge branch 'master' into feature/ml-optimize-read-from-source

3ac3eff

Fixing optimization paths

197f5d4

benwtrent marked this pull request as ready for review October 14, 2021 17:00

elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:ML Meta label for the ML team labels Oct 14, 2021

hendrikmuhs reviewed Oct 20, 2021

View reviewed changes

nik9000 reviewed Oct 21, 2021

View reviewed changes

benwtrent added 2 commits October 21, 2021 12:27

Merge branch 'master' into feature/ml-optimize-read-from-source

e7384d7

renaming function

368d801

nik9000 approved these changes Oct 21, 2021

View reviewed changes

benwtrent merged commit 4e8ed09 into elastic:master Oct 21, 2021

benwtrent deleted the feature/ml-optimize-read-from-source branch October 21, 2021 18:51

benwtrent mentioned this pull request Oct 21, 2021

Optimize significant_text aggregation to only parse the field it requires from _source #79651

Merged

jakelandis added v8.0.0-beta1 and removed v8.0.0 labels Oct 27, 2021

lcawl changed the title ~~[ML] optimize source extraction for categorize_text aggregation~~ [ML] Optimize source extraction for categorize_text aggregation Oct 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Optimize source extraction for categorize_text aggregation #79099

[ML] Optimize source extraction for categorize_text aggregation #79099

benwtrent commented Oct 13, 2021 •

edited

Loading

benwtrent commented Oct 14, 2021

benwtrent commented Oct 14, 2021

benwtrent commented Oct 14, 2021

elasticmachine commented Oct 14, 2021

elasticmachine commented Oct 14, 2021

hendrikmuhs left a comment

hendrikmuhs Oct 20, 2021

benwtrent Oct 20, 2021

nik9000 Oct 21, 2021

nik9000 Oct 21, 2021

nik9000 Oct 21, 2021

benwtrent Oct 21, 2021

nik9000 Oct 21, 2021

nik9000 Oct 21, 2021

benwtrent Oct 21, 2021

nik9000 Oct 21, 2021

nik9000 Oct 21, 2021

[ML] Optimize source extraction for categorize_text aggregation #79099

[ML] Optimize source extraction for categorize_text aggregation #79099

Conversation

benwtrent commented Oct 13, 2021 • edited Loading

benwtrent commented Oct 14, 2021

benwtrent commented Oct 14, 2021

benwtrent commented Oct 14, 2021

elasticmachine commented Oct 14, 2021

elasticmachine commented Oct 14, 2021

hendrikmuhs left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benwtrent commented Oct 13, 2021 •

edited

Loading