Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

Closed
Tracked by #6894
bryce-b opened this issue Nov 16, 2021 · 2 comments
Closed
Tracked by #6894

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

bryce-b opened this issue Nov 16, 2021 · 2 comments
Assignees
Labels
Milestone

Comments

@bryce-b
Copy link
Contributor

bryce-b commented Nov 16, 2021

APM Server version (apm-server version): 7.16.0-SNAPSHOT


Description of the problem including expected versus actual behavior:
The issue appears when tail.interval is introduced in the configuration yaml. The documentation describes using a flush interval no greater than half the duration of tail.ttl. Following this instruction, or setting any other value to tail.interval results in all throughput of apm server to cease.

Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.

  1. Running on cloud
  2. sending data w/ apm-integration-testing
  3. apply the following configuration:
apm-server:
  data_streams:
    enabled: true
  sampling:
    keep_unsampled: false
    tail:
      enabled: true
      ttl: 30s
      interval: 15s
      policies: 
        - sample_rate: 0.1 
@bryce-b bryce-b added the bug label Nov 16, 2021
@simitt simitt mentioned this issue Dec 17, 2021
21 tasks
@simitt simitt added this to the 8.1 milestone Dec 17, 2021
@simitt simitt changed the title Tail Based Sampling: defining apm-server.sampling.tail.interval causes throughput to cease Defining apm-server.sampling.tail.interval causes throughput to cease Dec 17, 2021
@stuartnelson3 stuartnelson3 self-assigned this Jan 11, 2022
@stuartnelson3
Copy link
Contributor

Using the latest 8.0 snapshots for kibana and elasticsearch + apm-server (6a45a89), I was able to ingest events using the config provided in the issue description. I sent 1000 events, and confirmed that 100 events (corresponding to sample_rate: 0.1) were present in traces-apm.sampled-default.

@bryce-b do you remember which opbeans you used? Or, do you still have the command line invocation that started apm-integration-testing?

program used:

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"

	"github.com/gorilla/mux"
	"go.elastic.co/apm"
	"go.elastic.co/apm/module/apmgorilla"
)

func helloHandler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, "Hello, %s!\n", mux.Vars(req)["name"])
}

func main() {
	port := flag.Int("p", 8000, "port to listen on")
	flag.Parse()
	tracer, err := apm.NewTracer("example-app", "abc123")
	if err != nil {
		log.Fatal(err)
	}

	r := mux.NewRouter()
	r.HandleFunc("/hello/{name}", helloHandler)
	r.Use(apmgorilla.Middleware(apmgorilla.WithTracer(tracer)))
	p := fmt.Sprintf(":%d", *port)
	log.Println("listening on port", p)
	log.Fatal(http.ListenAndServe(p, r))
}

@axw
Copy link
Member

axw commented Jan 18, 2022

I've also given it a shot with apm-integration-testing, using ./scripts/compose.py start 8.1.0 --with-opbeans-python. I modified docker-compose.yml with the config specified in the description (excluding data_streams & keep_unsampled, which are now the defaults). I ran that for a while, and then changed sample_rate to 0.5 and ran that for a while.

Here's a screenshot of the number of sampled transaction docs in Discover.

image

With sample_rate=0.1, the number of docs is approximately 10% of the original. With sample_rate 0.5, it's approximately 50%.

Jumping over to the APM app, we can see the throughput is fairly steady regardless of the sampling rate:

image

There's a drop in the throughput chart at the end, because the final (i.e. current) bucket is incomplete.

All seems to be working as expected. Seeing as neither @stuartnelson3 nor I could reproduce it, I'm going to close this.

@bryce-b if you are still able to reproduce the issue, or provide more details that can enable us to do so, please reopen.

@axw axw closed this as completed Jan 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants