Stop sending single-event metrics if max groups limit is exceeded #9648

axw · 2022-11-23T03:50:42Z

Context

To prevent OOMs, we place grouping cardinality limits on metrics aggregation. By default we allow at most 10000 groups of transaction, service, and service_destination metrics per reporting interval (1 minute). After each report, the groups are reset. Once the limits are reached, we start emitting single-event metrics.

Producing these single-event metrics means that, in the face of misconfigured or misbehaving agents that produce high-cardinality transaction names (or other dimensions), the ingest performance, query performance, and storage cost are unbounded. The impact on ingest performance will be exacerbated as we move to TSDS, which improves query performance but at the cost of ingest performance.

Proposed change

We will stop emitting single-event metrics to set some bounds on performance and storage cost. At the same time, we must ensure that users who do have high cardinality transactions/etc. do not have a worse experience than today. To that end we will ensure that users can observe that limits have been reached, and that the UI can inform users of how they might address this (e.g. fix instrumentation/agent configuration, or scale up APM Server).

Initially we will address transaction and service metrics.

We will have the following limits:

max total transaction groups (apm-server.aggregation.transaction.max_groups)
max transaction groups per service to avoid a single misbehaving service consuming all the buckets
max services

Transaction groups per service limit

Some services (e.g. RUM) may produce high cardinality transaction names, while others are well behaved. To avoid one misbehaving service to eat up the limit for all other services, we'll impose a per-service transaction group limit.

Once the limit has been reached, we will aggregate transactions in a dedicated "other" transaction group per service, e.g. transaction.name: "other". This special metricset has an additional metric that tracks the unique count (aka cardinality) of all transaction names that were grouped in this bucket.

Total transaction groups limit

When the apm-server.aggregation.transaction.max_groups limit is reached, we will increment the "other" transaction name bucket of the corresponding services. We'll ensure that even if the total transaction groups limit is reached, we can collect the "other" transaction name until the service metrics limit is reached.

This ensures that even if a small number of services use up all the aggregation buckets (10 services with 1000 transaction groups can exhaust the default limit on a small instance), we can still track the metrics that are needed for the service overview page.

Service metrics limit

We'll introduce a dedicated limit for the number of services for which we're collecting service metrics. Once the service limit has been reached, we will aggregate service metrics in an "other" service name. The service metrics for this special "other" service name has an additional metric that tracks the unique count (aka cardinality) of all service names that were grouped in this bucket.

Configuration

The transaction group limit is currently configurable in standalone APM Server, and defaults to 10000; it is not configurable for the integration. Similarly, the service limit defaults to 10000. We will maintain these going forward, but will change the defaults to be based on available memory.

The transaction groups per service limit will be 10% of the transaction group limit.

Memory	tx groups	tx groups per service	services
1GB	10000	1000	10000
8GB	80000	8000	80000
32GB	320000	32000	320000

The text was updated successfully, but these errors were encountered:

axw · 2022-11-23T03:51:37Z

Note to the implementer: due to the interplay of transaction and service metric limits, it probably makes sense to combine the txmetrics and servicemetrics code into a single processor that emits both transaction and service metrics.

lahsivjar · 2022-12-19T12:31:04Z

This special metricset has an additional metric that tracks the unique count (aka cardinality) of all transaction names that were grouped in this bucket.
The service metrics for this special "other" service name has an additional metric that tracks the unique count (aka cardinality) of all service names that were grouped in this bucket.

@felixbarny @axw Do we need to accurately track the cardinality metrics for other buckets? I am thinking we can use a probabilistic approach (like bloom filters) if we don't need to be very accurate.

(Adding a bit more detail) Since we are aggregating overflow data into the other bucket we would need some way to answer if we have seen a particular service.name or transaction.name in the past. If we can work with lower accuracy we can save some memory.

axw · 2022-12-19T13:50:27Z

@lahsivjar I think it would be reasonable for it to be probabilistic. Would HyperLogLog++ be a suitable choice here? That is what Elasticsearch's cardinality aggregation is based upon.

lahsivjar · 2022-12-19T14:01:35Z

Would HyperLogLog++ be a suitable choice here?

HyperLogLog++ looks like a good choice. @axw Do we have a preferred go-implementation for this algo? If not then, based on my brief read of the logic, I think it should not be too difficult to implement.

UPDATE: I can also find a few promising open-source implementations.

axw · 2022-12-20T03:30:52Z

UPDATE: I can also find a few promising open-source implementations.

@lahsivjar if there's something preexisting that is suitable (and which we enhance if needed), I think that would be ideal.

lahsivjar · 2023-02-09T04:19:04Z

How to test these changes

Scenario 1: Per service transaction limit overflow:

Run APM-Server with a known memory limit.
Check the logs of APM-Server to validate the configuration for MaxTransactionGroups and MaxGroups. The logs are in the format Transactions.MaxTransactionGroups set to %d based on %0.1fgb of memory and Transactions.MaxServices set to %d based on %0.1fgb of memory respectively. Assert that these two limits are approximately equal to gb_available*5_000.
Send a lot of transactions with 1 service name such that the number of transaction groups for a period of 1 minute is > 10% of MaxTransactionGroups.
Assert that metric documents are published with transaction.name: _other.

Scenario 2: Max transaction group overflow:

Testing this requires simulating at least 11 services for the max transaction group to overflow while keeping the values such that max transaction group per service limit is not breached. One way to test would be to use a bash script to generate random service names and run a program with each service name to generate and send traces. The number of service and the number of transactions should be contrained with #_of_services*#_of_transactions > max_transaction_groups && #_of_transactions < 10% of max_transaction_groups.

Sample bash script for 1GB server

#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<token>
export ELASTIC_APM_SERVER_URL=<url>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..13}
do
    ELASTIC_APM_SERVICE_NAME="random$i" go run main.go &
done

wait

Sample load generator for 1GB server

package main

import (
	"fmt"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	for i := 400; i >= 0; i-- {
		once(tracer, fmt.Sprintf("test%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction(name, "type")
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

After sending the load assert that metric documents are published with transaction.name: _other.

Scenario 3: Max services limit reached:

Testing this scenario is a bit tricky on cloud since it will require about 1000 services for the limit to be breached for a 1GB server. The easiest way would be to use the config max_services exposed in standalone versions:

Update the config file with aggregation.transactions.max_services to 1 and run APM-Server.
Send any number transactions with 2 different services.
Assert that metric documents are published with transaction.name: _other and service.name: _other.

lahsivjar · 2023-02-24T05:29:31Z

Found an issue where the per service txn group limit is not reset after publish: #10349

carsonip · 2023-03-28T17:02:39Z

Testing notes (transaction metrics)

Loadgen script

Sample load generator for 1GB server

package main

import (
	"fmt"
	"os"
	"strconv"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	g, err := strconv.Atoi(os.Getenv("TXGROUPS"))
	if err != nil {
		panic(err)
	}
	for i := g; i >= 1; i-- {
		once(tracer, fmt.Sprintf("test%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
	fmt.Println("ok finished publishing ", g)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction(name, "type")
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

How to test these changes

Run APM-Server with a known memory limit.
Check the logs of APM-Server to validate the configuration for MaxTransactionGroups and MaxGroups. The logs are in the format creating transaction metrics aggregation with config: {MaxTransactionGroups:5000 MaxServices:1000 HDRHistogramSignificantFigures:2}. Assert that MaxTransactionGroups is equal to gb_available*5_000 and MaxServices is gb_available*1_000.
i.e. with a 1GB server, MaxTransactionGroups is 5000 and MaxServices is 1000. Per-service max transaction groups is hardcoded as 10% of max transaction groups, i.e. 500.

Scenario 1: Per service transaction limit overflow:

Send transactions with 1 service name and 6000 different tx names, such that 600 > 500 (tx group limit per service).
Assert that metric documents are published with.
✔️ (In interval=1m index) Confirm that there are 501 hits and record count (sum of doc_count) = 600. 1 document with transaction.name: _other with doc_count=100. 500 documents with different transaction names.
✔️ ✔️ Above steps tested twice

Bash script to call load generator

#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr


ELASTIC_APM_SERVICE_NAME="fixed" TXGROUPS="600" ./main &

wait

Scenario 2: Max transaction group overflow:

Test with 13 services, each with 400 tx groups, such that 13 * 400 = 5200 > 5000 (max transaction groups).
✔️ (In interval=1m index) Confirm that "Count of records (sum of doc_count) = 5200"
✔️ There are 5 hits of transaction.name="_other" under 5 different services with doc_count summing up to 200.
✔️ ✔️ Above steps tested twice

Bash script to call load generator

#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..13}
do
    ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="400" ./main &
done

wait

Scenario 3: Max services limit reached:

Test with 2000 services, each with 1 tx group.
✔️ (In interval=1m index) Confirm that there are total 1001 hits and record count 2000. 1 metric document with count 1000 is published with transaction.name: _other and service.name: _other and 1000 documents with count 1 with different service names.
✔️ ✔️ Above steps tested twice

Bash script to call load generator

#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..2000}
do
    ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="1" ./main &
done

wait

carsonip · 2023-03-28T17:33:10Z

Testing notes (service transaction metrics)

Loadgen script

Sample load generator for 1GB server

package main

import (
	"fmt"
	"os"
	"strconv"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	g, err := strconv.Atoi(os.Getenv("TXTYPES"))
	if err != nil {
		panic(err)
	}
	for i := g; i >= 1; i-- {
		once(tracer, fmt.Sprintf("type%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
	fmt.Println("ok finished publishing ", g)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction("txname", name)
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

How to test these changes

Run APM-Server with 1GB memory limit.
Check the logs of APM-Server to validate the configuration for MaxGroups. The logs are in the format creating service transaction metrics aggregation with config: {MaxGroups:1000 HDRHistogramSignificantFigures:2}. Assert that MaxGroups is gb_available*1_000.
✔️ Confirm that MaxGroups is 1000.

Scenario 1: Max groups limit reached:

Test with 2000 different tx types.
✔️ (In interval=1m index) Confirm that there are total 1001 hits and record count 2000. 1 metric document with count 1000 is published with service.name: _other and 1000 documents with count 1 with different tx types.
✔️ ✔️ Above steps tested twice

Bash script to call load generator

#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

ELASTIC_APM_SERVICE_NAME="fixed" TXTYPES="2000" ./main &

wait

axw added the v8.7.0 label Nov 23, 2022

axw added this to the 8.7 milestone Nov 23, 2022

axw added the enhancement label Nov 23, 2022

simitt assigned lahsivjar Nov 23, 2022

lahsivjar mentioned this issue Dec 20, 2022

Use dedicated overflow buckets for txn and svc aggregation to limit cardinality #9856

Merged

2 tasks

lahsivjar mentioned this issue Dec 28, 2022

Use process time as event time for overflow metric #9911

Merged

lahsivjar closed this as completed in #9911 Dec 29, 2022

This was referenced Jan 10, 2023

Rename other bucket to _other #10012

Merged

Fix fields def and ingest pipelines for agg templates #10021

Merged

lahsivjar added the test-plan label Feb 9, 2023

lahsivjar removed their assignment Feb 22, 2023

lahsivjar mentioned this issue Feb 23, 2023

Collect monitoring for the overflow metrics for all aggregations #10330

Merged

1 task

lahsivjar added the test-plan-regression label Feb 24, 2023

lahsivjar removed the test-plan-regression label Mar 27, 2023

carsonip self-assigned this Mar 28, 2023

carsonip removed their assignment Mar 28, 2023

carsonip added the test-plan-ok label Mar 28, 2023

carsonip self-assigned this Apr 26, 2023

yngrdyn mentioned this issue Jul 7, 2023

[APM] UI improvements where service or transaction metrics limits are exceeded elastic/kibana#157479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stop sending single-event metrics if max groups limit is exceeded #9648

Stop sending single-event metrics if max groups limit is exceeded #9648

axw commented Nov 23, 2022 •

edited

Loading

axw commented Nov 23, 2022

lahsivjar commented Dec 19, 2022 •

edited

Loading

axw commented Dec 19, 2022

lahsivjar commented Dec 19, 2022 •

edited

Loading

axw commented Dec 20, 2022

lahsivjar commented Feb 9, 2023 •

edited

Loading

lahsivjar commented Feb 24, 2023

carsonip commented Mar 28, 2023 •

edited

Loading

carsonip commented Mar 28, 2023

Stop sending single-event metrics if max groups limit is exceeded #9648

Stop sending single-event metrics if max groups limit is exceeded #9648

Comments

axw commented Nov 23, 2022 • edited Loading

Context

Proposed change

Transaction groups per service limit

Total transaction groups limit

Service metrics limit

Configuration

axw commented Nov 23, 2022

lahsivjar commented Dec 19, 2022 • edited Loading

axw commented Dec 19, 2022

lahsivjar commented Dec 19, 2022 • edited Loading

axw commented Dec 20, 2022

lahsivjar commented Feb 9, 2023 • edited Loading

How to test these changes

Scenario 1: Per service transaction limit overflow:

Scenario 2: Max transaction group overflow:

Scenario 3: Max services limit reached:

lahsivjar commented Feb 24, 2023

carsonip commented Mar 28, 2023 • edited Loading

Testing notes (transaction metrics)

Loadgen script

How to test these changes

Scenario 1: Per service transaction limit overflow:

Scenario 2: Max transaction group overflow:

Scenario 3: Max services limit reached:

carsonip commented Mar 28, 2023

Testing notes (service transaction metrics)

Loadgen script

How to test these changes

Scenario 1: Max groups limit reached:

axw commented Nov 23, 2022 •

edited

Loading

lahsivjar commented Dec 19, 2022 •

edited

Loading

lahsivjar commented Dec 19, 2022 •

edited

Loading

lahsivjar commented Feb 9, 2023 •

edited

Loading

carsonip commented Mar 28, 2023 •

edited

Loading