Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling/pubsub: fix subscriber #7211

Merged
merged 3 commits into from
Feb 8, 2022
Merged

Conversation

axw
Copy link
Member

@axw axw commented Feb 7, 2022

Motivation/summary

The tail-based sampling subscriber can enter an infinite loop when one of the sampled traces indices contains more than 1000 documents. In this case, we never update the _seq_no filter and continuously issue the same search in an infinite loop.

We fix this by by setting the minimum bound on subsequent searches to the _seq_no returned by the previous search.

Checklist

How to test these changes

See #6642 (comment)

Related issues

Closes #6639
Closes #6642

@axw axw added v8.1.0 backport-8.0 Automated backport with mergify backport-8.1 Automated backport with mergify backport-7.17 Automated backport with mergify to the 7.17 branch labels Feb 7, 2022
Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.
@axw axw force-pushed the pubsub-subscriber-minseqno branch from a79cefa to d4fcd1d Compare February 7, 2022 05:52
@apmmachine
Copy link
Contributor

apmmachine commented Feb 7, 2022

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2022-02-08T01:42:34.804+0000

  • Duration: 86 min 42 sec

Test stats 🧪

Test Results
Failed 0
Passed 5374
Skipped 18
Total 5392

🤖 GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /hey-apm : Run the hey-apm benchmark.

  • /package : Generate and publish the docker images.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@axw axw marked this pull request as ready for review February 7, 2022 06:04
@axw axw requested a review from a team February 7, 2022 06:05
@axw axw enabled auto-merge (squash) February 7, 2022 06:44
@axw
Copy link
Member Author

axw commented Feb 7, 2022

/test

@axw
Copy link
Member Author

axw commented Feb 8, 2022

/test

@axw axw merged commit 796fda2 into elastic:main Feb 8, 2022
mergify bot pushed a commit that referenced this pull request Feb 8, 2022
Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)
mergify bot pushed a commit that referenced this pull request Feb 8, 2022
Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)

# Conflicts:
#	changelogs/head.asciidoc
mergify bot pushed a commit that referenced this pull request Feb 8, 2022
Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)

# Conflicts:
#	changelogs/head.asciidoc
@axw axw deleted the pubsub-subscriber-minseqno branch February 8, 2022 03:09
axw added a commit that referenced this pull request Feb 8, 2022
* sampling/pubsub: fix subscriber (#7211)

Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)

# Conflicts:
#	changelogs/head.asciidoc

* Delete head.asciidoc

Co-authored-by: Andrew Wilkins <[email protected]>
axw added a commit that referenced this pull request Feb 8, 2022
* sampling/pubsub: fix subscriber (#7211)

Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)

# Conflicts:
#	changelogs/head.asciidoc

* Delete head.asciidoc

Co-authored-by: Andrew Wilkins <[email protected]>
@axw axw added the test-plan label Feb 8, 2022
axw added a commit that referenced this pull request Feb 9, 2022
Break infinite loop in subscriber by setting a minimum bound
on subsequent searches to the _seq_no returned by the previous
search.

(cherry picked from commit 796fda2)

Co-authored-by: Andrew Wilkins <[email protected]>
@stuartnelson3
Copy link
Contributor

Confirmed with bc2.

Steps:

  1. Started stack (docker-compose up -d), enabled internal stack monitoring
  2. Added default tail-sampling policy to apm-server.yml:
    apm-server:
      sampling:
        tail:
          enabled: true
          interval: 10s
          policies:
          - default:
            sample_rate: 0.5
  3. Ran two APM Servers:
    1. ./apm-server -c apm-server.dev.yml -e
    2. ./apm-server -c apm-server.dev.yml -e -E apm-server.host=localhost:8201 -E path.data=/tmp
  4. Ran the below program 5 times; waited for docs to be indexed (watching Elasticsearch stack monitoring); repeat once more
  5. Search rate and CPU do not increase; graph indicates documents are being indexed
package main

import (
        "os"

        "go.elastic.co/apm"
        "go.elastic.co/apm/transport"
)

func main() {
	os.Setenv("ELASTIC_APM_SERVER_URL", "http://localhost:8200")
	transport1, _ := transport.NewHTTPTransport()
	tracer1, _ := apm.NewTracerOptions(apm.TracerOptions{
		ServiceName: "svc1",
		Transport:   transport1,
	})
	defer tracer1.Flush(nil)

	os.Setenv("ELASTIC_APM_SERVER_URL", "http://localhost:8201")
	transport2, _ := transport.NewHTTPTransport()
	tracer2, _ := apm.NewTracerOptions(apm.TracerOptions{
		ServiceName: "svc2",
		Transport:   transport2,
	})
	defer tracer2.Flush(nil)

	for i := 0; i < 500; i++ {
		tx1 := tracer1.StartTransaction("tx1", "type")
		span := tx1.StartSpan("span", "type", nil)
		tx2 := tracer2.StartTransactionOptions("tx2", "type", apm.TransactionOptions{
			TraceContext: span.TraceContext(),
		})
		tx2.End()
		span.End()
		tx1.End()
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport with mergify to the 7.17 branch backport-8.0 Automated backport with mergify backport-8.1 Automated backport with mergify test-plan test-plan-ok v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

APM Server hung in high CPU utilization ES search rate very high when TBS enabled
4 participants