Add segment sorter for data streams #75195

mayya-sharipova · 2021-07-09T19:25:17Z

It is beneficial to sort segments within a datastream's index
by desc order of their max timestamp field, so
that the most recent (in terms of timestamp) segments
will be first.

This allows to speed up sort query on @timestamp desc field,
which is the most common type of query for datastreams,
as we are mostly concerned with the recent data.
This patch addressed this for writable indices.

Segments' sorter is different from index sorting.
An index sorter by itself is only concerned about the order of docs
within an individual segment (and not how the segments are organized),
while the segment sorter is only used during search and allows
to start docs collection with the "right" segment,
so we can terminate the collection faster.

TODO:

Should we also do this for frozen or readonly indices?
Or they always supposed to have 1 segment?

elasticmachine · 2021-07-09T19:25:34Z

Pinging @elastic/es-search (Team:Search)

mayya-sharipova · 2021-07-14T13:13:08Z

@elasticmachine run elasticsearch-ci/part-1

@timestamp

It is beneficial to add leaf sorter for data streams by desc order of their max timestamp field, so that the most recent (in terms of timestamp) segments will be first. This allows to speed up sort query on @timestamp desc field, which is the most common type of query for datastreams, as we are mostly concerned with the recent data. This patch addressed for writable indices. This path also adds a new flag isDataStreamIndex to IndexMetadata that shows if this index is a part of datastream or not. TODO: - Should we also to this for frozen or readonly indices?

dakrone · 2021-07-15T16:01:56Z

@mayya-sharipova I have a question because I am unfamiliar with this part of the code :) Is this leaf sorter the same as regular index sorting, or is this something that only takes effect on query or merging? Does this affect the indexing throughput the way the index-time sorting can?

mayya-sharipova · 2021-07-16T19:48:18Z

@dakrone Thank you for taking a look. Answering your questions:

Is this leaf sorter the same as regular index sorting, or is this something that only takes effect on query or merging?

Segments' sorter is different from index sorting, and is only used during querying. An index sorter by itself is only concerned about the order of docs within an individual segment (and not how the segments are organized), while the segment sorter is used during querying to start docs collection with the "right" segment, so we can terminate the collection faster once we collected to the top docs (for data streams, these would be the most recent docs).

Does this affect the indexing throughput the way the index-time sorting can?

No, it doesn't. Segment's sorter is only used during querying.

I've also clarified the PR comment with this information.

hendrikmuhs · 2021-07-21T07:26:23Z

Does sorting get always applied?

In the case of rollup and transform a composite agg is used and both read data in ascending order (date histogram), in addition both can not benefit from an early exit, but actually disable this explicitly.

That's why I am concerned about this change if it changes the behavior for all cases.

jimczi

I left a comment regarding the index metadata change. It should be possible to detect if an index is part of a datastream dynamically.

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

jimczi · 2021-07-21T07:35:52Z

server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetadata.java

@@ -387,6 +389,7 @@ public static APIBlock readFrom(StreamInput input) throws IOException {
    private final ActiveShardCount waitForActiveShards;
    private final ImmutableOpenMap<String, RolloverInfo> rolloverInfos;
    private final boolean isSystem;
+    private final boolean isDataStreamIndex;


I don't think we should change the index metadata for this information. That would only work on data streams created after 7.15 unless I missed something ?
Checking if a concrete index is a data stream can be done dynamically when the IndexShard or IndexService is created ? You can access the index abstraction lookup with clusterService.state().metadata().getIndicesLookup().

+1 to what @jimczi said here. Individual indices can be detached from and added to data streams in a variety of scenarios including snapshot restores and the relatively new Migrate to Data Stream API. If such a flag were included in index metadata, code would need to be added in all such places to correct the index metadata and we'd prefer to avoid that.

@jimczi Thank you for comment. I've started with a dynamic lookup clusterService.state().metadata().getIndicesLookup() like you suggested but it is not allowed to to access a cluster state while modifying cluster state (e.g. during index creation). There is a function ClusterApplierService::assertNotCalledFromClusterStateApplier that checks for it, and returns an error: "the applied cluster state is not yet available". I can investigate if it is possible to do this check somewhere earlier.

@danhermann Thanks for checking, and highlighting Migrate Data Stream API. I would think that this API would be quite rare to use, and most datastream indices would be created in a usual way. But indeed if used, we would need to update index metadata.

I will see if it is possible to do this check dynamically, but as we build more code around datastreams it would be nice to have an easy way to check if a shard is a part of datastream, with APIs like: IndexMetadata::isDataStreamIndex or IndexShard::isDataStreamIndex.

mayya-sharipova · 2021-07-21T17:01:30Z

@hendrikmuhs Thanks for looking and raising your concerns. Answering your questions:

Does sorting get always applied?

In the case of rollup and transform a composite agg is used and both read data in ascending order (date histogram), in addition both can not benefit from an early exit, but actually disable this explicitly.

Yes, it will be always applied to datastream indices. But this PR is about sorting of segments, in terms of which segment is better to start when doing document collection. It adds a tiny overhead, and for aggregations that collect all documents from all segments, it doesn't really matter from what segment to start, and I don't think there will be any impact on the performance of date histogram or other aggs (but I will make sure to benchmark for this).

I think a bigger concern for your use case could be #71186, which proposes to add sort on @timestamp by default to all searches on datastream indices. Please raise your very valid concerns on that issue.

This reverts commit 5360987. As we want to dynamically specify in an index belongs to data stream.

hendrikmuhs · 2021-07-22T16:13:17Z

@mayya-sharipova

Thanks for the response. It would be good to benchmark it. If you need help to set this up let me know. It seems tricky to benchmark, I might be wrong, but a lot of benchmarks force-merge to 1 segment before testing, we obviously don't want that behavior in this case.

I have a non-PR'd yet transform track which might be useful, however we don't need transform for this. The aggregation tracks might work as well.

@timestamp

It is beneficial to sort segments within a datastream's index by desc order of their max timestamp field, so that the most recent (in terms of timestamp) segments will be first. This allows to speed up sort query on @timestamp desc field, which is the most common type of query for datastreams, as we are mostly concerned with the recent data. This patch addressed this for writable indices. Segments' sorter is different from index sorting. An index sorter by itself is only concerned about the order of docs within an individual segment (and not how the segments are organized), while the segment sorter is only used during search and allows to start docs collection with the "right" segment, so we can terminate the collection faster. This PR adds a property to IndexShard `isDataStreamIndex` that shows if a shard is a part of datastream.

mayya-sharipova · 2021-07-26T19:44:57Z

@jimczi Thanks for your review so far. In a9d44d9, I've added a data-stream check as you suggested; the check is done in IndicesClusterStateService::createShard, and now it will a property of IndexShard object instead of being a part of stored metadata.

There is an unresolved issue with migrate with data stream API, as there is no closing and reopening of shard during migration, a migration to data stream on already existing migrated indices will not be reflected. We have two options to address it:

Decide that it doesn't matter. Migration happens rarely, this only affects speedups in search and on new indices it will be reflected anyway.
@dnhatn suggested to apply leaf sorter not only on data stream indices, but on all indices that have @timestamp field. For this, we can check MapperService in a similar way as it is done in TimestampFieldMapperService.

@jimczi WDYT?

jpountz · 2021-07-27T14:38:21Z

@hendrikmuhs Do I understand you correctly that benchmarking date histograms on @timestamp should be a good enough proxy to make sure that transforms are unaffected by this change? Mayya and I just discussed about doing this before merging this change.

jimczi · 2021-07-27T14:42:14Z

@dnhatn suggested to apply leaf sorter not only on data stream indices, but on all indices that have @timestamp field. For this, we can check MapperService in a similar way as it is done in TimestampFieldMapperService.

I like the idea. That'll probably require to move the timestamp metadata field in core but that makes sense to me.

hendrikmuhs · 2021-07-28T08:33:10Z

@hendrikmuhs Do I understand you correctly that benchmarking date histograms on @timestamp should be a good enough proxy to make sure that transforms are unaffected by this change? Mayya and I just discussed about doing this before merging this change.

Yes, that sound like a good plan to me. Thanks for looking into it!

dimitris-athanasiou · 2021-07-28T10:24:56Z

Another use case that might be affected is scrolling through all docs in ascending order of a date field. This is what ML anomaly detection does when aggregations can't be used (which is many use cases). It might be worth also benchmarking if segment sorting slows this use case down significantly.

…sorter

after merging of PR#75906

mayya-sharipova · 2021-08-04T18:46:19Z

@elasticmachine run elasticsearch-ci/part-1

…sorter

@timestamp

@timestamp field.

…sorter

if the segment doesn't contain documents.

mayya-sharipova · 2021-09-02T18:40:22Z

@elasticmachine run elasticsearch-ci/part-2

…sorter

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

mayya-sharipova · 2021-09-03T11:54:31Z

@elasticmachine run elasticsearch-ci/bwc

…sorter

@timestamp

It is beneficial to sort segments within a datastream's index by desc order of their max timestamp field, so that the most recent (in terms of timestamp) segments will be first. This allows to speed up sort query on @timestamp desc field, which is the most common type of query for datastreams, as we are mostly concerned with the recent data. This patch addressed this for writable indices. Segments' sorter is different from index sorting. An index sorter by itself is concerned about the order of docs within an individual segment (and not how the segments are organized), while the segment sorter is only used during search and allows to start docs collection with the "right" segment, so we can terminate the collection faster. This PR adds a property to IndexShard `isDataStreamIndex` that shows if a shard is a part of datastream.

@timestamp

It is beneficial to sort segments within a datastream's index by desc order of their max timestamp field, so that the most recent (in terms of timestamp) segments will be first. This allows to speed up sort query on @timestamp desc field, which is the most common type of query for datastreams, as we are mostly concerned with the recent data. This patch addressed this for writable indices. Segments' sorter is different from index sorting. An index sorter by itself is concerned about the order of docs within an individual segment (and not how the segments are organized), while the segment sorter is only used during search and allows to start docs collection with the "right" segment, so we can terminate the collection faster. This PR adds a property to IndexShard `isDataStreamIndex` that shows if a shard is a part of datastream. Backport for #75195

* master: (128 commits) Mute DieWithDignityIT (elastic#77283) Fix randomization in MlNodeShutdownIT (elastic#77281) Add target_node_name for REPLACE shutdown type (elastic#77151) [DOCS] Adds information about version compatibility headers (elastic#77096) Fix template equals when mappings are wrapped (elastic#77008) Fix TextFieldMapper Retaining a Reference to its Builder (elastic#77251) Move die with dignity to be a test module (elastic#77136) Update task names for rest compatiblity (elastic#75267) [ML] adjusting bwc serialization for elastic#77256 (elastic#77257) Move `index.hidden` from Static to Dynamic settings (elastic#77218) Handle cgroups v2 in `OsProbe` (elastic#77128) Choose postings format from FieldMapper instead of MappedFieldType (elastic#77234) Add segment sorter for data streams (elastic#75195) Update skip after backport (elastic#77212) [ML] adding new defer_definition_decompression parameter to put trained model API (elastic#77189) [ML] Fix bug in inference stats persister for when feature reset is called Only check replicas in cancelling existing recoveries. (elastic#60564) Format `AbstractFilteringTestCase` (elastic#77217) [DOCS] Fixes line breaks. (elastic#77248) Convert 'routing' values in REST API tests to strings ... # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java

@timestamp

PR elastic#75195 added segment sorter on @timestamp desc for datastream indices. This PR applies segment sorter to all indices that have @timestamp field. The presence of @timestamp field can serve as a strong indication that we are dealing with timeseries indices. The most common type of query for timeseries indices is to get the latest data, that is data sorted by @timestamp desc. This PR sorts segments by @timestamp desc which allows to speed up this kind of queries. Relates to elastic#75195

@timestamp

PR #75195 added segment sorter on @timestamp desc for datastream indices. This PR applies segment sorter to all indices that have @timestamp field. The presence of @timestamp field can serve as a strong indication that we are dealing with timeseries indices. The most common type of query for timeseries indices is to get the latest data, that is data sorted by @timestamp desc. This PR sorts segments by @timestamp desc which allows to speed up this kind of queries. Relates to #75195

@timestamp

PR elastic#75195 added segment sorter on @timestamp desc for datastream indices. This PR applies segment sorter to all indices that have @timestamp field. The presence of @timestamp field can serve as a strong indication that we are dealing with timeseries indices. The most common type of query for timeseries indices is to get the latest data, that is data sorted by @timestamp desc. This PR sorts segments by @timestamp desc which allows to speed up this kind of queries. Backport for elastic#78639 Relates to elastic#75195

@timestamp

PR #75195 added segment sorter on @timestamp desc for datastream indices. This PR applies segment sorter to all indices that have @timestamp field. The presence of @timestamp field can serve as a strong indication that we are dealing with timeseries indices. The most common type of query for timeseries indices is to get the latest data, that is data sorted by @timestamp desc. This PR sorts segments by @timestamp desc which allows to speed up this kind of queries. Backport for #78639 Relates to #75195

Ordinary indices support sorting their segments on the timestamp when they have such a field defined in their mappings (see elastic#75195). This was not initially supported as a directory reader could not be created provding a leaf sorter, which is now possible in Lucene and we can make use of.

Ordinary indices support sorting their segments on the timestamp when they have such a field defined in their mappings (see #75195). This was not initially supported as a directory reader could not be created provding a leaf sorter, which is now possible in Lucene and we can make use of.

mayya-sharipova added the :Search/Search Search-related issues that do not fall into other categories label Jul 9, 2021

elasticmachine added the Team:Search Meta label for search team label Jul 9, 2021

mayya-sharipova added >enhancement v8.0.0 v7.15.0 labels Jul 9, 2021

mayya-sharipova force-pushed the datastreams-leaf-sorter branch from 3106e3a to 4661286 Compare July 12, 2021 15:30

mayya-sharipova force-pushed the datastreams-leaf-sorter branch from 4661286 to 3b25eed Compare July 14, 2021 15:57

mayya-sharipova force-pushed the datastreams-leaf-sorter branch from 3b25eed to 5360987 Compare July 14, 2021 18:11

mayya-sharipova changed the title ~~Add default leaf sorter for data streams~~ Add segment sorter for data streams Jul 16, 2021

jimczi reviewed Jul 21, 2021

View reviewed changes

Revert "Add default leaf sorter for data streams"

97b9f72

This reverts commit 5360987. As we want to dynamically specify in an index belongs to data stream.

mayya-sharipova added 2 commits August 4, 2021 11:02

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

8845d8d

…sorter

IndexShard to use mappingLookup().isDataStreamTimestampFieldEnabled()

e73bb4c

after merging of PR#75906

mayya-sharipova added 2 commits August 4, 2021 15:37

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

72fa961

…sorter

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

711a52b

…sorter

mayya-sharipova added 5 commits September 2, 2021 10:56

Throw an exception when datastream doesn't contain

6a04a57

@timestamp field.

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

730d40b

…sorter

Remove temp debugging

8724cde

Change IllegalArgument to IllegatState exception

fbb937b

Return Long.MIN_VALUE for datastream sorter

77941bc

if the segment doesn't contain documents.

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

4e0e2a2

…sorter

mayya-sharipova commented Sep 2, 2021

View reviewed changes

server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java Outdated Show resolved Hide resolved

Change assertion to if statement

4c5bddc

Merge remote-tracking branch 'upstream/master' into datastreams-leaf-…

6d8eefc

…sorter

mayya-sharipova merged commit f18b9d5 into elastic:master Sep 3, 2021

mayya-sharipova deleted the datastreams-leaf-sorter branch September 3, 2021 13:43

mayya-sharipova mentioned this pull request Sep 3, 2021

Add segment sorter for data streams #77261

Merged

jpountz added the release highlight label Sep 8, 2021

danhermann mentioned this pull request Sep 9, 2021

Update data stream test skip versions #77517

Merged

jakelandis added v8.0.0-alpha2 and removed v8.0.0 labels Sep 15, 2021

mayya-sharipova mentioned this pull request Oct 4, 2021

Expand segment sorter for all timeseries indices #78639

Merged

mayya-sharipova mentioned this pull request Oct 5, 2021

Expand segment sorter for all timeseries indices #78713

Merged

javanna mentioned this pull request Feb 7, 2023

Sort segments on timestamp in read only engine #93576

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add segment sorter for data streams #75195

Add segment sorter for data streams #75195

mayya-sharipova commented Jul 9, 2021 •

edited

Loading

elasticmachine commented Jul 9, 2021

mayya-sharipova commented Jul 14, 2021

dakrone commented Jul 15, 2021

mayya-sharipova commented Jul 16, 2021 •

edited

Loading

hendrikmuhs commented Jul 21, 2021

jimczi left a comment

jimczi Jul 21, 2021

danhermann Jul 21, 2021 •

edited

Loading

mayya-sharipova Jul 21, 2021

mayya-sharipova commented Jul 21, 2021 •

edited

Loading

hendrikmuhs commented Jul 22, 2021

mayya-sharipova commented Jul 26, 2021 •

edited

Loading

jpountz commented Jul 27, 2021

jimczi commented Jul 27, 2021

hendrikmuhs commented Jul 28, 2021

dimitris-athanasiou commented Jul 28, 2021

mayya-sharipova commented Aug 4, 2021

mayya-sharipova commented Sep 2, 2021

mayya-sharipova commented Sep 3, 2021

Add segment sorter for data streams #75195

Add segment sorter for data streams #75195

Conversation

mayya-sharipova commented Jul 9, 2021 • edited Loading

elasticmachine commented Jul 9, 2021

mayya-sharipova commented Jul 14, 2021

dakrone commented Jul 15, 2021

mayya-sharipova commented Jul 16, 2021 • edited Loading

hendrikmuhs commented Jul 21, 2021

jimczi left a comment

Choose a reason for hiding this comment

jimczi Jul 21, 2021

Choose a reason for hiding this comment

danhermann Jul 21, 2021 • edited Loading

Choose a reason for hiding this comment

mayya-sharipova Jul 21, 2021

Choose a reason for hiding this comment

mayya-sharipova commented Jul 21, 2021 • edited Loading

hendrikmuhs commented Jul 22, 2021

mayya-sharipova commented Jul 26, 2021 • edited Loading

jpountz commented Jul 27, 2021

jimczi commented Jul 27, 2021

hendrikmuhs commented Jul 28, 2021

dimitris-athanasiou commented Jul 28, 2021

mayya-sharipova commented Aug 4, 2021

mayya-sharipova commented Sep 2, 2021

mayya-sharipova commented Sep 3, 2021

mayya-sharipova commented Jul 9, 2021 •

edited

Loading

mayya-sharipova commented Jul 16, 2021 •

edited

Loading

danhermann Jul 21, 2021 •

edited

Loading

mayya-sharipova commented Jul 21, 2021 •

edited

Loading

mayya-sharipova commented Jul 26, 2021 •

edited

Loading