Defining apm-server.sampling.tail.interval causes throughput to cease #6638

bryce-b · 2021-11-16T17:46:20Z

APM Server version (apm-server version): 7.16.0-SNAPSHOT

Description of the problem including expected versus actual behavior:
The issue appears when tail.interval is introduced in the configuration yaml. The documentation describes using a flush interval no greater than half the duration of tail.ttl. Following this instruction, or setting any other value to tail.interval results in all throughput of apm server to cease.

Steps to reproduce:
Please include a minimal but complete recreation of the problem,
including server configuration, agent(s) used, etc. The easier you make it
for us to reproduce it, the more likely that somebody will take the time to
look at it.

Running on cloud
sending data w/ apm-integration-testing
apply the following configuration:

apm-server:
  data_streams:
    enabled: true
  sampling:
    keep_unsampled: false
    tail:
      enabled: true
      ttl: 30s
      interval: 15s
      policies: 
        - sample_rate: 0.1

The text was updated successfully, but these errors were encountered:

stuartnelson3 · 2022-01-11T10:08:15Z

Using the latest 8.0 snapshots for kibana and elasticsearch + apm-server (6a45a89), I was able to ingest events using the config provided in the issue description. I sent 1000 events, and confirmed that 100 events (corresponding to sample_rate: 0.1) were present in traces-apm.sampled-default.

@bryce-b do you remember which opbeans you used? Or, do you still have the command line invocation that started apm-integration-testing?

program used:

package main

import (
	"flag"
	"fmt"
	"log"
	"net/http"

	"github.com/gorilla/mux"
	"go.elastic.co/apm"
	"go.elastic.co/apm/module/apmgorilla"
)

func helloHandler(w http.ResponseWriter, req *http.Request) {
	fmt.Fprintf(w, "Hello, %s!\n", mux.Vars(req)["name"])
}

func main() {
	port := flag.Int("p", 8000, "port to listen on")
	flag.Parse()
	tracer, err := apm.NewTracer("example-app", "abc123")
	if err != nil {
		log.Fatal(err)
	}

	r := mux.NewRouter()
	r.HandleFunc("/hello/{name}", helloHandler)
	r.Use(apmgorilla.Middleware(apmgorilla.WithTracer(tracer)))
	p := fmt.Sprintf(":%d", *port)
	log.Println("listening on port", p)
	log.Fatal(http.ListenAndServe(p, r))
}

axw · 2022-01-18T09:45:29Z

I've also given it a shot with apm-integration-testing, using ./scripts/compose.py start 8.1.0 --with-opbeans-python. I modified docker-compose.yml with the config specified in the description (excluding data_streams & keep_unsampled, which are now the defaults). I ran that for a while, and then changed sample_rate to 0.5 and ran that for a while.

Here's a screenshot of the number of sampled transaction docs in Discover.

With sample_rate=0.1, the number of docs is approximately 10% of the original. With sample_rate 0.5, it's approximately 50%.

Jumping over to the APM app, we can see the throughput is fairly steady regardless of the sampling rate:

There's a drop in the throughput chart at the end, because the final (i.e. current) bucket is incomplete.

All seems to be working as expected. Seeing as neither @stuartnelson3 nor I could reproduce it, I'm going to close this.

@bryce-b if you are still able to reproduce the issue, or provide more details that can enable us to do so, please reopen.

bryce-b added the bug label Nov 16, 2021

simitt mentioned this issue Dec 17, 2021

Tail Based Sampling GA #6894

Closed

21 tasks

simitt added this to the 8.1 milestone Dec 17, 2021

simitt changed the title ~~Tail Based Sampling: defining apm-server.sampling.tail.interval causes throughput to cease~~ Defining apm-server.sampling.tail.interval causes throughput to cease Dec 17, 2021

stuartnelson3 self-assigned this Jan 11, 2022

axw closed this as completed Jan 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

bryce-b commented Nov 16, 2021

stuartnelson3 commented Jan 11, 2022

axw commented Jan 18, 2022

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

Defining apm-server.sampling.tail.interval causes throughput to cease #6638

Comments

bryce-b commented Nov 16, 2021

stuartnelson3 commented Jan 11, 2022

axw commented Jan 18, 2022