-
Notifications
You must be signed in to change notification settings - Fork 527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop sending single-event metrics if max groups limit is exceeded #9648
Comments
Note to the implementer: due to the interplay of transaction and service metric limits, it probably makes sense to combine the txmetrics and servicemetrics code into a single processor that emits both transaction and service metrics. |
@felixbarny @axw Do we need to accurately track the cardinality metrics for (Adding a bit more detail) Since we are aggregating overflow data into the |
@lahsivjar I think it would be reasonable for it to be probabilistic. Would HyperLogLog++ be a suitable choice here? That is what Elasticsearch's cardinality aggregation is based upon. |
HyperLogLog++ looks like a good choice. @axw Do we have a preferred go-implementation for this algo? If not then, based on my brief read of the logic, I think it should not be too difficult to implement. UPDATE: I can also find a few promising open-source implementations. |
@lahsivjar if there's something preexisting that is suitable (and which we enhance if needed), I think that would be ideal. |
How to test these changesScenario 1: Per service transaction limit overflow:
Scenario 2: Max transaction group overflow:Testing this requires simulating at least 11 services for the max transaction group to overflow while keeping the values such that max transaction group per service limit is not breached. One way to test would be to use a bash script to generate random service names and run a program with each service name to generate and send traces. The number of service and the number of transactions should be contrained with Sample bash script for 1GB server#!/bin/bash
export ELASTIC_APM_SECRET_TOKEN=<token>
export ELASTIC_APM_SERVER_URL=<url>
export ELASTIC_APM_LOG_FILE=stderr
for i in {1..13}
do
ELASTIC_APM_SERVICE_NAME="random$i" go run main.go &
done
wait Sample load generator for 1GB serverpackage main
import (
"fmt"
"time"
"go.elastic.co/apm/v2"
)
func main() {
tracer := apm.DefaultTracer()
for i := 400; i >= 0; i-- {
once(tracer, fmt.Sprintf("test%d", i))
time.Sleep(time.Millisecond)
}
tracer.Flush(nil)
}
func once(tracer *apm.Tracer, name string) {
tx := tracer.StartTransaction(name, "type")
defer tx.End()
span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
time.Sleep(time.Millisecond * 1)
span.Outcome = "success"
span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
Resource: fmt.Sprintf("dest_resource"),
})
span.End()
} After sending the load assert that metric documents are published with Scenario 3: Max services limit reached:Testing this scenario is a bit tricky on cloud since it will require about 1000 services for the limit to be breached for a 1GB server. The easiest way would be to use the config
|
Found an issue where the per service txn group limit is not reset after publish: #10349 |
Testing notes (transaction metrics)Loadgen scriptSample load generator for 1GB serverpackage main
import (
"fmt"
"os"
"strconv"
"time"
"go.elastic.co/apm/v2"
)
func main() {
tracer := apm.DefaultTracer()
g, err := strconv.Atoi(os.Getenv("TXGROUPS"))
if err != nil {
panic(err)
}
for i := g; i >= 1; i-- {
once(tracer, fmt.Sprintf("test%d", i))
time.Sleep(time.Millisecond)
}
tracer.Flush(nil)
fmt.Println("ok finished publishing ", g)
}
func once(tracer *apm.Tracer, name string) {
tx := tracer.StartTransaction(name, "type")
defer tx.End()
span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
time.Sleep(time.Millisecond * 1)
span.Outcome = "success"
span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
Resource: fmt.Sprintf("dest_resource"),
})
span.End()
} How to test these changes
Scenario 1: Per service transaction limit overflow:
Bash script to call load generator#!/bin/bash
export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr
ELASTIC_APM_SERVICE_NAME="fixed" TXGROUPS="600" ./main &
wait Scenario 2: Max transaction group overflow:
Bash script to call load generator#!/bin/bash
export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr
for i in {1..13}
do
ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="400" ./main &
done
wait Scenario 3: Max services limit reached:
Bash script to call load generator#!/bin/bash
export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr
for i in {1..2000}
do
ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="1" ./main &
done
wait |
Testing notes (service transaction metrics)Loadgen scriptSample load generator for 1GB serverpackage main
import (
"fmt"
"os"
"strconv"
"time"
"go.elastic.co/apm/v2"
)
func main() {
tracer := apm.DefaultTracer()
g, err := strconv.Atoi(os.Getenv("TXTYPES"))
if err != nil {
panic(err)
}
for i := g; i >= 1; i-- {
once(tracer, fmt.Sprintf("type%d", i))
time.Sleep(time.Millisecond)
}
tracer.Flush(nil)
fmt.Println("ok finished publishing ", g)
}
func once(tracer *apm.Tracer, name string) {
tx := tracer.StartTransaction("txname", name)
defer tx.End()
span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
time.Sleep(time.Millisecond * 1)
span.Outcome = "success"
span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
Resource: fmt.Sprintf("dest_resource"),
})
span.End()
} How to test these changes
Scenario 1: Max groups limit reached:
Bash script to call load generator#!/bin/bash
export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr
ELASTIC_APM_SERVICE_NAME="fixed" TXTYPES="2000" ./main &
wait |
Context
To prevent OOMs, we place grouping cardinality limits on metrics aggregation. By default we allow at most 10000 groups of
transaction
,service
, andservice_destination
metrics per reporting interval (1 minute). After each report, the groups are reset. Once the limits are reached, we start emitting single-event metrics.Producing these single-event metrics means that, in the face of misconfigured or misbehaving agents that produce high-cardinality transaction names (or other dimensions), the ingest performance, query performance, and storage cost are unbounded. The impact on ingest performance will be exacerbated as we move to TSDS, which improves query performance but at the cost of ingest performance.
Proposed change
We will stop emitting single-event metrics to set some bounds on performance and storage cost. At the same time, we must ensure that users who do have high cardinality transactions/etc. do not have a worse experience than today. To that end we will ensure that users can observe that limits have been reached, and that the UI can inform users of how they might address this (e.g. fix instrumentation/agent configuration, or scale up APM Server).
Initially we will address transaction and service metrics.
We will have the following limits:
apm-server.aggregation.transaction.max_groups
)Transaction groups per service limit
Some services (e.g. RUM) may produce high cardinality transaction names, while others are well behaved. To avoid one misbehaving service to eat up the limit for all other services, we'll impose a per-service transaction group limit.
Once the limit has been reached, we will aggregate transactions in a dedicated "other" transaction group per service, e.g.
transaction.name: "other"
. This special metricset has an additional metric that tracks the unique count (aka cardinality) of all transaction names that were grouped in this bucket.Total transaction groups limit
When the apm-server.aggregation.transaction.max_groups limit is reached, we will increment the "other" transaction name bucket of the corresponding services. We'll ensure that even if the total transaction groups limit is reached, we can collect the "other" transaction name until the service metrics limit is reached.
This ensures that even if a small number of services use up all the aggregation buckets (10 services with 1000 transaction groups can exhaust the default limit on a small instance), we can still track the metrics that are needed for the service overview page.
Service metrics limit
We'll introduce a dedicated limit for the number of services for which we're collecting service metrics. Once the service limit has been reached, we will aggregate service metrics in an "other" service name. The service metrics for this special "other" service name has an additional metric that tracks the unique count (aka cardinality) of all service names that were grouped in this bucket.
Configuration
The transaction group limit is currently configurable in standalone APM Server, and defaults to 10000; it is not configurable for the integration. Similarly, the service limit defaults to 10000. We will maintain these going forward, but will change the defaults to be based on available memory.
The transaction groups per service limit will be 10% of the transaction group limit.
The text was updated successfully, but these errors were encountered: