Lazy data stream rollover is not triggered when using reroute #112781

axw · 2024-09-12T02:47:46Z

Elasticsearch Version

8.15.1

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

Lazy rollover on a data stream is not triggered when writing a document that is rerouted to another data stream. This affects the apm-data plugin, where we perform a lazy rollover of matching data stream patterns when installing or updating index templates. The data stream never rolls over. See elastic/apm-server#14060 (comment)

Should a write that leads to a reroute also trigger the lazy rollover? I think so, otherwise the default pipeline will not change.

Steps to Reproduce

Create an index template which sets a default ingest pipeline with reroute

PUT /_ingest/pipeline/demo-reroute
{
  "processors": [
    {
      "reroute": {"namespace": "foo"}
    }
  ]
}

PUT /_index_template/demo_1
{
  "index_patterns" : ["demo*"],
  "data_stream": {}, 
  "priority" : 1,
  "template": {
    "settings" : {
      "number_of_shards": 1,
      "index.default_pipeline": "demo-reroute"
    }
  }
}

Create a data stream matching the index template

PUT /_data_stream/demo-dataset-default

Send a document to the data stream; it will be rerouted

POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-foo-2024.09.12-000001",
  "_id": "z2Ab5JEBCHevSrCVP7aG",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Create another index template with higher priority with the same index pattern, with no default ingest pipeline

PUT /_index_template/demo_2
{
  "index_patterns" : ["demo*"],
  "data_stream": {}, 
  "priority" : 2
}

Rollover the source data stream with "lazy=true"

POST /demo-dataset-default/_rollover?lazy=true

Send a document to the data stream; it will still be rerouted

POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-foo-2024.09.12-000001",
  "_id": "x2gc5JEBfAEizTaQVStE",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}

Rollover the source data stream with "lazy=false"

POST /demo-dataset-default/_rollover?lazy=false

Send a document to the data stream; it will not be rerouted

POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-default-2024.09.12-000002",
  "_id": "1mAf5JEBCHevSrCVc7YV",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2024-09-13T15:21:35Z

Pinging @elastic/es-data-management (Team:Data Management)

axw · 2024-09-25T04:51:34Z

Is this a more general case of lazy rollover only being triggered post ingest pipeline, and not specific to rerouting? We're also seeing issues related to upgrading from older versions of of APM (e.g. 8.12.1) to 8.15.1, without any reroute processor involved.

gmarouli · 2024-10-01T06:27:36Z

Hi @axw, we did not know you were doing version checks on the pipelines, so yes, that is definitely a side effect of the lazy rollover happening only upon a write to the index. The timing of the rollover is important though because if we rollover earlier we risk creating empty indices.

We discussed possible approaches to solve this in a way that does not produce extra indices and we have the following proposal:

When a data stream is marked for a lazy rollover (and only then)
We we would resolve the template and retrieve the most up-to-date pipeline to be executed.

This way we have the following benefits:

We are using the latest pipeline
We do not rollover if not necessary (aka if there is a reroute processor)

The drawbacks:

If the data stream marked for a lazy rollover has a pipeline with the reroute processor, we risk resolving the templates once per request "forever" since no writes will reach that data streams. We should check the overhead this adds to the indexing process.

axw · 2024-10-01T06:38:11Z

@gmarouli thanks, sounds reasonable. Just to clarify, we don't do version checks in recent versions of our ingest pipeline - that only applies to versions before 8.13.0.

If the data stream marked for a lazy rollover has a pipeline with the reroute processor, we risk resolving the templates once per request "forever" since no writes will reach that data streams. We should check the overhead this adds to the indexing process.

+1 that was also my first thought.

Would it make sense to extend this approach to also update the marked data stream after executing the pipeline if there were no writes?

gmarouli · 2024-10-02T11:01:42Z

Would it make sense to extend this approach to also update the marked data stream after executing the pipeline if there were no writes?

What do you mean with this?

axw · 2024-10-03T02:16:43Z

@gmarouli sorry, that was very unclear, let me try again.

If the data stream is marked for lazy rollover, do what you described where we resolve any settings (e.g. ingest pipeline) that may affect ingestion from the matching index template; then if there was a change in template, execute the rollover even if there were no writes to the data stream's backing index. That way we wouldn't need to do the template resolution on every write to the data stream, only once per lazy rollover.

gmarouli · 2024-10-03T06:53:31Z

@axw thank you for the explanation, I get it now.

You are right, that would address the potential latency but we would be creating empty indices which is something we want to avoid. Let's say what's the impact and if it can be sustained until we have a more structural solution available.

simitt · 2024-10-28T13:12:18Z

@gmarouli is this something that can be fixed for 8.16.x or latest 8.17.0? APM customers are experiencing a bad time where indices aren't rolled over if they are upgrading from versions <= 8.13.0 or if they are making use of custom ingest pipelines with reroute processors.

mattc58 · 2024-10-29T13:58:15Z

@simitt we're discussing some options to address this. Will post back here later today with our suggested approach and timeline.

mattc58 · 2024-10-29T15:46:39Z

Ok @parkertimmins is going to work on this. Our initial thought is that we can get this done in a week or so, and we'll target 8.16.1 and 8.17.0.

parkertimmins · 2024-10-30T20:41:31Z

I've been working on this ticket today, and have added a prototype change that re-resolves default pipeline from templates if lazy rollover is set. This appears to work fine.

Currently, it does the pipeline resolution for every index request within a bulk request. This will need to be optimized to only do resolution once per index written to within bulk request, which will add some complexity. I think finishing the feature itself will take another 2 days. So, including the time for functional and performance tests, I think 1 week is a decent estimate.

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: elastic#112781

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: elastic#112781 (cherry picked from commit 6db39d1) # Conflicts: # server/src/main/java/org/elasticsearch/action/bulk/TransportAbstractBulkAction.java # server/src/main/java/org/elasticsearch/ingest/IngestService.java

… (#116131) * Resolve pipelines from template if lazy rollover write (#116031) If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: #112781 * Remute tests blocking merge * Remute tests blocking merge

#116132) * Resolve pipelines from template if lazy rollover write (#116031) If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: #112781 * Remute tests block merge * Remute tests block merge

…6137) If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: #112781 (cherry picked from commit 6db39d1)

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index. Fixes: elastic#112781

parkertimmins · 2024-11-08T18:02:26Z

As there are some concerns about a performance regression on this ticket, I've run the following benchmarks.

Benchmark Structure

Two separate benchmark test were run, one with 3 data streams and one with 2 data streams, which we call 3-layer and 2-layer. The layer 2 test attempts to insert into data-stream-1 which reroutes to data-steam-2, where all docs are inserted. The layer 3 test attempts to insert into data-stream-1, which reroutes to data-steam-2, which reroutes to data-stream 3, where the docs are inserted.

The fix made in the ticket will behave differently on the 2-layer and 3-layer tests. In both cases, on the initial data-stream, the pipeline will only be resolved from templates once per bulk request. This is because the resolved template is cached per data stream being inserted into, and in this test there is only one per request. One the other hand, in the 3 layer test, in the second reroute, every doc in a bulk request requires a separate pipeline resolution. In both tests, the data streams being rerouted away from have the lazy rollover flag set.

In both tests, the data streams with reroute pipelines each had 10 matching index templates. Each index template was composed of 10 component templates. The final data stream which received documents only had a single matching index templates.

The bulk request batch size was varied between 10, 100, 1000, and 10000 docs. The inserted dataset contains 3.2 millions documents. All data streams indices had 1 primary and 0 replicas. This was run on a single node cluster, on a single machine with 64gb ram and 20 CPUs. The tests were run in rally. Though this configuration is not typical of a production cluster, we expected a single node cluster with 0 replicas to have the worst-case behavior for tested change.

Results

The following plots show the throughput in docs/second of the test vs baseline code on the 2-layer and 3-layer benchmarks.

These are combined in the following plot which shows the symmetric percent difference between test and baseline for both 2 and 3 layer benchmarks percent diff = 100 * (test - base) / min(test, base). A positive value is the percentage of throughput that test is better than baseline, a negative value is the percent that baseline is better.

As expected, the test version performs worse in most cases. On average we see a 2% decrease in throughput across all tests, and a max decrease in throughput of 10.7% for the 3 layer test with batch size of 1000. Notably, the 3 layer test with batch size of 10k, only has a decrease of 2.9%. This is likely a result of the pipeline caching making up for the slowdown caused by template resolution. For this same reason, we see a 10% improvement over the baseline in the 2 layer test with 10k docs per batch.

Conclusion

This test was designed to show the worst case scenario for the new feature. In most cases, overhead from other operations will obviate the slowdown caused by this change. Given this, the average decrease in performance of 2% seems an acceptable trade-off for a necessary bug fix.

axw added >bug needs:triage Requires assignment of a team area label labels Sep 12, 2024

axw mentioned this issue Sep 12, 2024

"failed to parse field [error.grouping_name] of type [keyword] in document" after upgrading to 8.15 elastic/apm-server#14060

Open

javanna added :Data Management/Data streams Data streams and their lifecycles and removed needs:triage Requires assignment of a team area label labels Sep 13, 2024

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 13, 2024

axw mentioned this issue Oct 15, 2024

Automatically migrate default ILM policy to Data Stream Lifecycle Management elastic/apm-server#14128

Open

mattc58 assigned parkertimmins Oct 29, 2024

parkertimmins mentioned this issue Oct 30, 2024

Resolve pipeline on lazy rollover write #115987

Closed

simitt mentioned this issue Oct 31, 2024

APM: add known issue of lazy rollover bug elastic/observability-docs#4459

Merged

10 tasks

parkertimmins mentioned this issue Oct 31, 2024

Resolve pipeline on lazy rollover write #116031

Merged

parkertimmins closed this as completed in #116031 Nov 2, 2024

parkertimmins closed this as completed in 6db39d1 Nov 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy data stream rollover is not triggered when using reroute #112781

Lazy data stream rollover is not triggered when using reroute #112781

axw commented Sep 12, 2024

elasticsearchmachine commented Sep 13, 2024

axw commented Sep 25, 2024

gmarouli commented Oct 1, 2024

axw commented Oct 1, 2024

gmarouli commented Oct 2, 2024

axw commented Oct 3, 2024

gmarouli commented Oct 3, 2024

simitt commented Oct 28, 2024

mattc58 commented Oct 29, 2024

mattc58 commented Oct 29, 2024

parkertimmins commented Oct 30, 2024

parkertimmins commented Nov 8, 2024 •

edited

Loading

Lazy data stream rollover is not triggered when using reroute #112781

Lazy data stream rollover is not triggered when using reroute #112781

Comments

axw commented Sep 12, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

elasticsearchmachine commented Sep 13, 2024

axw commented Sep 25, 2024

gmarouli commented Oct 1, 2024

axw commented Oct 1, 2024

gmarouli commented Oct 2, 2024

axw commented Oct 3, 2024

gmarouli commented Oct 3, 2024

simitt commented Oct 28, 2024

mattc58 commented Oct 29, 2024

mattc58 commented Oct 29, 2024

parkertimmins commented Oct 30, 2024

parkertimmins commented Nov 8, 2024 • edited Loading

Benchmark Structure

Results

Conclusion

parkertimmins commented Nov 8, 2024 •

edited

Loading