Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy data stream rollover is not triggered when using reroute #112781

Closed
axw opened this issue Sep 12, 2024 · 12 comments · Fixed by #116031
Closed

Lazy data stream rollover is not triggered when using reroute #112781

axw opened this issue Sep 12, 2024 · 12 comments · Fixed by #116031
Assignees
Labels
>bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team

Comments

@axw
Copy link
Member

axw commented Sep 12, 2024

Elasticsearch Version

8.15.1

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

Lazy rollover on a data stream is not triggered when writing a document that is rerouted to another data stream. This affects the apm-data plugin, where we perform a lazy rollover of matching data stream patterns when installing or updating index templates. The data stream never rolls over. See elastic/apm-server#14060 (comment)

Should a write that leads to a reroute also trigger the lazy rollover? I think so, otherwise the default pipeline will not change.

Steps to Reproduce

  1. Create an index template which sets a default ingest pipeline with reroute
PUT /_ingest/pipeline/demo-reroute
{
  "processors": [
    {
      "reroute": {"namespace": "foo"}
    }
  ]
}

PUT /_index_template/demo_1
{
  "index_patterns" : ["demo*"],
  "data_stream": {}, 
  "priority" : 1,
  "template": {
    "settings" : {
      "number_of_shards": 1,
      "index.default_pipeline": "demo-reroute"
    }
  }
}
  1. Create a data stream matching the index template
PUT /_data_stream/demo-dataset-default
  1. Send a document to the data stream; it will be rerouted
POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-foo-2024.09.12-000001",
  "_id": "z2Ab5JEBCHevSrCVP7aG",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}
  1. Create another index template with higher priority with the same index pattern, with no default ingest pipeline
PUT /_index_template/demo_2
{
  "index_patterns" : ["demo*"],
  "data_stream": {}, 
  "priority" : 2
}
  1. Rollover the source data stream with "lazy=true"
POST /demo-dataset-default/_rollover?lazy=true
  1. Send a document to the data stream; it will still be rerouted
POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-foo-2024.09.12-000001",
  "_id": "x2gc5JEBfAEizTaQVStE",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 1,
  "_primary_term": 1
}
  1. Rollover the source data stream with "lazy=false"
POST /demo-dataset-default/_rollover?lazy=false
  1. Send a document to the data stream; it will not be rerouted
POST /demo-dataset-default/_doc
{
  "@timestamp": "2024-09-12"
}

{
  "_index": ".ds-demo-dataset-default-2024.09.12-000002",
  "_id": "1mAf5JEBCHevSrCVc7YV",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Logs (if relevant)

No response

@axw axw added >bug needs:triage Requires assignment of a team area label labels Sep 12, 2024
@javanna javanna added :Data Management/Data streams Data streams and their lifecycles and removed needs:triage Requires assignment of a team area label labels Sep 13, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Sep 13, 2024
@axw
Copy link
Member Author

axw commented Sep 25, 2024

Is this a more general case of lazy rollover only being triggered post ingest pipeline, and not specific to rerouting? We're also seeing issues related to upgrading from older versions of of APM (e.g. 8.12.1) to 8.15.1, without any reroute processor involved.

@gmarouli
Copy link
Contributor

gmarouli commented Oct 1, 2024

Hi @axw, we did not know you were doing version checks on the pipelines, so yes, that is definitely a side effect of the lazy rollover happening only upon a write to the index. The timing of the rollover is important though because if we rollover earlier we risk creating empty indices.

We discussed possible approaches to solve this in a way that does not produce extra indices and we have the following proposal:

  1. When a data stream is marked for a lazy rollover (and only then)
  2. We we would resolve the template and retrieve the most up-to-date pipeline to be executed.

This way we have the following benefits:

  • We are using the latest pipeline
  • We do not rollover if not necessary (aka if there is a reroute processor)

The drawbacks:

  • If the data stream marked for a lazy rollover has a pipeline with the reroute processor, we risk resolving the templates once per request "forever" since no writes will reach that data streams. We should check the overhead this adds to the indexing process.

@axw
Copy link
Member Author

axw commented Oct 1, 2024

@gmarouli thanks, sounds reasonable. Just to clarify, we don't do version checks in recent versions of our ingest pipeline - that only applies to versions before 8.13.0.

If the data stream marked for a lazy rollover has a pipeline with the reroute processor, we risk resolving the templates once per request "forever" since no writes will reach that data streams. We should check the overhead this adds to the indexing process.

+1 that was also my first thought.

Would it make sense to extend this approach to also update the marked data stream after executing the pipeline if there were no writes?

@gmarouli
Copy link
Contributor

gmarouli commented Oct 2, 2024

Would it make sense to extend this approach to also update the marked data stream after executing the pipeline if there were no writes?

What do you mean with this?

@axw
Copy link
Member Author

axw commented Oct 3, 2024

@gmarouli sorry, that was very unclear, let me try again.

If the data stream is marked for lazy rollover, do what you described where we resolve any settings (e.g. ingest pipeline) that may affect ingestion from the matching index template; then if there was a change in template, execute the rollover even if there were no writes to the data stream's backing index. That way we wouldn't need to do the template resolution on every write to the data stream, only once per lazy rollover.

@gmarouli
Copy link
Contributor

gmarouli commented Oct 3, 2024

@axw thank you for the explanation, I get it now.

You are right, that would address the potential latency but we would be creating empty indices which is something we want to avoid. Let's say what's the impact and if it can be sustained until we have a more structural solution available.

@simitt
Copy link
Contributor

simitt commented Oct 28, 2024

@gmarouli is this something that can be fixed for 8.16.x or latest 8.17.0? APM customers are experiencing a bad time where indices aren't rolled over if they are upgrading from versions <= 8.13.0 or if they are making use of custom ingest pipelines with reroute processors.

@mattc58
Copy link
Contributor

mattc58 commented Oct 29, 2024

@simitt we're discussing some options to address this. Will post back here later today with our suggested approach and timeline.

@mattc58
Copy link
Contributor

mattc58 commented Oct 29, 2024

Ok @parkertimmins is going to work on this. Our initial thought is that we can get this done in a week or so, and we'll target 8.16.1 and 8.17.0.

@parkertimmins
Copy link
Contributor

I've been working on this ticket today, and have added a prototype change that re-resolves default pipeline from templates if lazy rollover is set. This appears to work fine.

Currently, it does the pipeline resolution for every index request within a bulk request. This will need to be optimized to only do resolution once per index written to within bulk request, which will add some complexity. I think finishing the feature itself will take another 2 days. So, including the time for functional and performance tests, I think 1 week is a decent estimate.

parkertimmins added a commit to parkertimmins/elasticsearch that referenced this issue Nov 2, 2024
If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: elastic#112781
parkertimmins added a commit to parkertimmins/elasticsearch that referenced this issue Nov 2, 2024
If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: elastic#112781
parkertimmins added a commit to parkertimmins/elasticsearch that referenced this issue Nov 2, 2024
If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: elastic#112781
(cherry picked from commit 6db39d1)

# Conflicts:
#	server/src/main/java/org/elasticsearch/action/bulk/TransportAbstractBulkAction.java
#	server/src/main/java/org/elasticsearch/ingest/IngestService.java
elasticsearchmachine pushed a commit that referenced this issue Nov 2, 2024
… (#116131)

* Resolve pipelines from template if lazy rollover write  (#116031)

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: #112781

* Remute tests blocking merge

* Remute tests blocking merge
elasticsearchmachine pushed a commit that referenced this issue Nov 2, 2024
#116132)

* Resolve pipelines from template if lazy rollover write  (#116031)

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: #112781

* Remute tests block merge

* Remute tests block merge
parkertimmins added a commit that referenced this issue Nov 2, 2024
…6137)

If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: #112781
(cherry picked from commit 6db39d1)
jfreden pushed a commit to jfreden/elasticsearch that referenced this issue Nov 4, 2024
If datastream rollover on write flag is set in cluster state, resolve pipelines from templates rather than from metadata. This fixes the following bug: when a pipeline reroutes every document to another index, and rollover is called with lazy=true (setting the rollover on write flag), changes to the pipeline do not go into effect, because the lack of writes means the data stream never rolls over and pipelines in metadata are not updated. The fix is to resolve pipelines from templates if the lazy rollover flag is set. To improve efficiency we only resolve pipelines once per index in the bulk request, caching the value, and reusing for other requests to the same index.

Fixes: elastic#112781
@parkertimmins
Copy link
Contributor

parkertimmins commented Nov 8, 2024

As there are some concerns about a performance regression on this ticket, I've run the following benchmarks.

Benchmark Structure

Two separate benchmark test were run, one with 3 data streams and one with 2 data streams, which we call 3-layer and 2-layer. The layer 2 test attempts to insert into data-stream-1 which reroutes to data-steam-2, where all docs are inserted. The layer 3 test attempts to insert into data-stream-1, which reroutes to data-steam-2, which reroutes to data-stream 3, where the docs are inserted.

The fix made in the ticket will behave differently on the 2-layer and 3-layer tests. In both cases, on the initial data-stream, the pipeline will only be resolved from templates once per bulk request. This is because the resolved template is cached per data stream being inserted into, and in this test there is only one per request. One the other hand, in the 3 layer test, in the second reroute, every doc in a bulk request requires a separate pipeline resolution. In both tests, the data streams being rerouted away from have the lazy rollover flag set.

In both tests, the data streams with reroute pipelines each had 10 matching index templates. Each index template was composed of 10 component templates. The final data stream which received documents only had a single matching index templates.

The bulk request batch size was varied between 10, 100, 1000, and 10000 docs. The inserted dataset contains 3.2 millions documents. All data streams indices had 1 primary and 0 replicas. This was run on a single node cluster, on a single machine with 64gb ram and 20 CPUs. The tests were run in rally. Though this configuration is not typical of a production cluster, we expected a single node cluster with 0 replicas to have the worst-case behavior for tested change.

Results

The following plots show the throughput in docs/second of the test vs baseline code on the 2-layer and 3-layer benchmarks.

Image

Image

These are combined in the following plot which shows the symmetric percent difference between test and baseline for both 2 and 3 layer benchmarks percent diff = 100 * (test - base) / min(test, base). A positive value is the percentage of throughput that test is better than baseline, a negative value is the percent that baseline is better.

Image

As expected, the test version performs worse in most cases. On average we see a 2% decrease in throughput across all tests, and a max decrease in throughput of 10.7% for the 3 layer test with batch size of 1000. Notably, the 3 layer test with batch size of 10k, only has a decrease of 2.9%. This is likely a result of the pipeline caching making up for the slowdown caused by template resolution. For this same reason, we see a 10% improvement over the baseline in the 2 layer test with 10k docs per batch.

Conclusion

This test was designed to show the worst case scenario for the new feature. In most cases, overhead from other operations will obviate the slowdown caused by this change. Given this, the average decrease in performance of 2% seems an acceptable trade-off for a necessary bug fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants