Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

APM Server hung in high CPU utilization #6642

Closed
Tracked by #6894
bryce-b opened this issue Nov 16, 2021 · 4 comments · Fixed by #7211
Closed
Tracked by #6894

APM Server hung in high CPU utilization #6642

bryce-b opened this issue Nov 16, 2021 · 4 comments · Fixed by #7211
Assignees
Labels
Milestone

Comments

@bryce-b
Copy link
Contributor

bryce-b commented Nov 16, 2021

APM Server version (apm-server version): 7.16.0-SNAPSHOT

Description of the problem including expected versus actual behavior:
When testing Tail-based sampling, the APM Server got into a high cpu-utilization that couldn't be mitigated. This occurred while editing and updating the apm server configs. I thought it might have been associated with policy configs, but it occurred a second time on a new deployment editing different config values.

To fix the high cpu rate, I tried restarting the containers, updating the config, disabling TBS, several times, and the cpu usage did not change.

The two deployments are still available for debugging:
7f3939f
6de56fc

@bryce-b bryce-b added the bug label Nov 16, 2021
@simitt simitt mentioned this issue Dec 17, 2021
21 tasks
@simitt simitt added this to the 8.1 milestone Dec 17, 2021
@axw
Copy link
Member

axw commented Jan 24, 2022

I think we should wait for elastic/kibana#121534 to test this in ESS again.

@axw
Copy link
Member

axw commented Feb 4, 2022

I think I have reproduced this locally, and I suspect this and #6639 are closely related. In one test run I noticed an increased search rate which hasn't dropped back to baseline, and a high level of CPU consumption.

I'll have to instrument the server to find out what's going on.

@axw
Copy link
Member

axw commented Feb 4, 2022

Also, I reproduced the issue like this:

  1. Started stack (docker-compose up -d), enabled internal stack monitoring
  2. Using runapm: make policy name and reinstall flags #7197, I ran two APM Servers:
    1. go run ./systemtest/cmd/runapm/main.go -f -var tail_sampling_enabled=true -var tail_sampling_policies='[{"sample_rate":0.5}]'
    2. go run ./cmd/runapm/main.go -policy=runapm2 -f -reinstall=false -var tail_sampling_enabled=true -var tail_sampling_policies='[{"sample_rate":0.5}]'
  3. Ran the below program 5 times; waited for docs to be indexed (watching Elasticsearch stack monitoring); repeat once more
  4. Observe search rate increase and never revert; observe increased CPU
package main

import (
        "os"

        "go.elastic.co/apm"
        "go.elastic.co/apm/transport"
)

func main() {
        os.Setenv("ELASTIC_APM_SERVER_URL", "http://localhost:49160")
        transport1, _ := transport.NewHTTPTransport()
        tracer1, _ := apm.NewTracerOptions(apm.TracerOptions{
                ServiceName: "svc1",
                Transport:   transport1,
        })
        defer tracer1.Flush(nil)

        os.Setenv("ELASTIC_APM_SERVER_URL", "http://localhost:49162")
        transport2, _ := transport.NewHTTPTransport()
        tracer2, _ := apm.NewTracerOptions(apm.TracerOptions{
                ServiceName: "svc2",
                Transport:   transport2,
        })
        defer tracer2.Flush(nil)

        for i := 0; i < 500; i++ {
                tx1 := tracer1.StartTransaction("tx1", "type")
                span := tx1.StartSpan("span", "type", nil)
                tx2 := tracer2.StartTransactionOptions("tx2", "type", apm.TransactionOptions{
                        TraceContext: span.TraceContext(),
                })
                tx2.End()
                span.End()
                tx1.End()
        }
}

@simitt
Copy link
Contributor

simitt commented Feb 4, 2022

@axw assigned you since you started looking into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants