Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop sending single-event metrics if max groups limit is exceeded #9648

Closed
axw opened this issue Nov 23, 2022 · 9 comments · Fixed by #9911
Closed

Stop sending single-event metrics if max groups limit is exceeded #9648

axw opened this issue Nov 23, 2022 · 9 comments · Fixed by #9911

Comments

@axw
Copy link
Member

axw commented Nov 23, 2022

Context

To prevent OOMs, we place grouping cardinality limits on metrics aggregation. By default we allow at most 10000 groups of transaction, service, and service_destination metrics per reporting interval (1 minute). After each report, the groups are reset. Once the limits are reached, we start emitting single-event metrics.

Producing these single-event metrics means that, in the face of misconfigured or misbehaving agents that produce high-cardinality transaction names (or other dimensions), the ingest performance, query performance, and storage cost are unbounded. The impact on ingest performance will be exacerbated as we move to TSDS, which improves query performance but at the cost of ingest performance.

Proposed change

We will stop emitting single-event metrics to set some bounds on performance and storage cost. At the same time, we must ensure that users who do have high cardinality transactions/etc. do not have a worse experience than today. To that end we will ensure that users can observe that limits have been reached, and that the UI can inform users of how they might address this (e.g. fix instrumentation/agent configuration, or scale up APM Server).

Initially we will address transaction and service metrics.

We will have the following limits:

  1. max total transaction groups (apm-server.aggregation.transaction.max_groups)
  2. max transaction groups per service to avoid a single misbehaving service consuming all the buckets
  3. max services

Transaction groups per service limit

Some services (e.g. RUM) may produce high cardinality transaction names, while others are well behaved. To avoid one misbehaving service to eat up the limit for all other services, we'll impose a per-service transaction group limit.

Once the limit has been reached, we will aggregate transactions in a dedicated "other" transaction group per service, e.g. transaction.name: "other". This special metricset has an additional metric that tracks the unique count (aka cardinality) of all transaction names that were grouped in this bucket.

Total transaction groups limit

When the apm-server.aggregation.transaction.max_groups limit is reached, we will increment the "other" transaction name bucket of the corresponding services. We'll ensure that even if the total transaction groups limit is reached, we can collect the "other" transaction name until the service metrics limit is reached.

This ensures that even if a small number of services use up all the aggregation buckets (10 services with 1000 transaction groups can exhaust the default limit on a small instance), we can still track the metrics that are needed for the service overview page.

Service metrics limit

We'll introduce a dedicated limit for the number of services for which we're collecting service metrics. Once the service limit has been reached, we will aggregate service metrics in an "other" service name. The service metrics for this special "other" service name has an additional metric that tracks the unique count (aka cardinality) of all service names that were grouped in this bucket.

Configuration

The transaction group limit is currently configurable in standalone APM Server, and defaults to 10000; it is not configurable for the integration. Similarly, the service limit defaults to 10000. We will maintain these going forward, but will change the defaults to be based on available memory.

The transaction groups per service limit will be 10% of the transaction group limit.

Memory tx groups tx groups per service services
1GB 10000 1000 10000
8GB 80000 8000 80000
32GB 320000 32000 320000
@axw axw added the v8.7.0 label Nov 23, 2022
@axw
Copy link
Member Author

axw commented Nov 23, 2022

Note to the implementer: due to the interplay of transaction and service metric limits, it probably makes sense to combine the txmetrics and servicemetrics code into a single processor that emits both transaction and service metrics.

@axw axw added this to the 8.7 milestone Nov 23, 2022
@lahsivjar
Copy link
Contributor

lahsivjar commented Dec 19, 2022

This special metricset has an additional metric that tracks the unique count (aka cardinality) of all transaction names that were grouped in this bucket.
The service metrics for this special "other" service name has an additional metric that tracks the unique count (aka cardinality) of all service names that were grouped in this bucket.

@felixbarny @axw Do we need to accurately track the cardinality metrics for other buckets? I am thinking we can use a probabilistic approach (like bloom filters) if we don't need to be very accurate.

(Adding a bit more detail) Since we are aggregating overflow data into the other bucket we would need some way to answer if we have seen a particular service.name or transaction.name in the past. If we can work with lower accuracy we can save some memory.

@axw
Copy link
Member Author

axw commented Dec 19, 2022

@lahsivjar I think it would be reasonable for it to be probabilistic. Would HyperLogLog++ be a suitable choice here? That is what Elasticsearch's cardinality aggregation is based upon.

@lahsivjar
Copy link
Contributor

lahsivjar commented Dec 19, 2022

Would HyperLogLog++ be a suitable choice here?

HyperLogLog++ looks like a good choice. @axw Do we have a preferred go-implementation for this algo? If not then, based on my brief read of the logic, I think it should not be too difficult to implement.

UPDATE: I can also find a few promising open-source implementations.

@axw
Copy link
Member Author

axw commented Dec 20, 2022

UPDATE: I can also find a few promising open-source implementations.

@lahsivjar if there's something preexisting that is suitable (and which we enhance if needed), I think that would be ideal.

@lahsivjar
Copy link
Contributor

lahsivjar commented Feb 9, 2023

How to test these changes

Scenario 1: Per service transaction limit overflow:

  1. Run APM-Server with a known memory limit.
  2. Check the logs of APM-Server to validate the configuration for MaxTransactionGroups and MaxGroups. The logs are in the format Transactions.MaxTransactionGroups set to %d based on %0.1fgb of memory and Transactions.MaxServices set to %d based on %0.1fgb of memory respectively. Assert that these two limits are approximately equal to gb_available*5_000.
  3. Send a lot of transactions with 1 service name such that the number of transaction groups for a period of 1 minute is > 10% of MaxTransactionGroups.
  4. Assert that metric documents are published with transaction.name: _other.

Scenario 2: Max transaction group overflow:

Testing this requires simulating at least 11 services for the max transaction group to overflow while keeping the values such that max transaction group per service limit is not breached. One way to test would be to use a bash script to generate random service names and run a program with each service name to generate and send traces. The number of service and the number of transactions should be contrained with #_of_services*#_of_transactions > max_transaction_groups && #_of_transactions < 10% of max_transaction_groups.

Sample bash script for 1GB server
#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<token>
export ELASTIC_APM_SERVER_URL=<url>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..13}
do
    ELASTIC_APM_SERVICE_NAME="random$i" go run main.go &
done

wait
Sample load generator for 1GB server
package main

import (
	"fmt"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	for i := 400; i >= 0; i-- {
		once(tracer, fmt.Sprintf("test%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction(name, "type")
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

After sending the load assert that metric documents are published with transaction.name: _other.

Scenario 3: Max services limit reached:

Testing this scenario is a bit tricky on cloud since it will require about 1000 services for the limit to be breached for a 1GB server. The easiest way would be to use the config max_services exposed in standalone versions:

  1. Update the config file with aggregation.transactions.max_services to 1 and run APM-Server.
  2. Send any number transactions with 2 different services.
  3. Assert that metric documents are published with transaction.name: _other and service.name: _other.

@lahsivjar
Copy link
Contributor

Found an issue where the per service txn group limit is not reset after publish: #10349

@carsonip
Copy link
Member

carsonip commented Mar 28, 2023

Testing notes (transaction metrics)

Loadgen script

Sample load generator for 1GB server
package main

import (
	"fmt"
	"os"
	"strconv"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	g, err := strconv.Atoi(os.Getenv("TXGROUPS"))
	if err != nil {
		panic(err)
	}
	for i := g; i >= 1; i-- {
		once(tracer, fmt.Sprintf("test%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
	fmt.Println("ok finished publishing ", g)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction(name, "type")
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

How to test these changes

  1. Run APM-Server with a known memory limit.
  2. Check the logs of APM-Server to validate the configuration for MaxTransactionGroups and MaxGroups. The logs are in the format creating transaction metrics aggregation with config: {MaxTransactionGroups:5000 MaxServices:1000 HDRHistogramSignificantFigures:2}. Assert that MaxTransactionGroups is equal to gb_available*5_000 and MaxServices is gb_available*1_000.
  3. i.e. with a 1GB server, MaxTransactionGroups is 5000 and MaxServices is 1000. Per-service max transaction groups is hardcoded as 10% of max transaction groups, i.e. 500.

Scenario 1: Per service transaction limit overflow:

  1. Send transactions with 1 service name and 6000 different tx names, such that 600 > 500 (tx group limit per service).
  2. Assert that metric documents are published with.
  3. ✔️ (In interval=1m index) Confirm that there are 501 hits and record count (sum of doc_count) = 600. 1 document with transaction.name: _other with doc_count=100. 500 documents with different transaction names.
  4. ✔️ ✔️ Above steps tested twice
Bash script to call load generator
#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr


ELASTIC_APM_SERVICE_NAME="fixed" TXGROUPS="600" ./main &

wait

Scenario 2: Max transaction group overflow:

  1. Test with 13 services, each with 400 tx groups, such that 13 * 400 = 5200 > 5000 (max transaction groups).
  2. ✔️ (In interval=1m index) Confirm that "Count of records (sum of doc_count) = 5200"
  3. ✔️ There are 5 hits of transaction.name="_other" under 5 different services with doc_count summing up to 200.
  4. ✔️ ✔️ Above steps tested twice
Bash script to call load generator
#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..13}
do
    ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="400" ./main &
done

wait

Scenario 3: Max services limit reached:

  1. Test with 2000 services, each with 1 tx group.
  2. ✔️ (In interval=1m index) Confirm that there are total 1001 hits and record count 2000. 1 metric document with count 1000 is published with transaction.name: _other and service.name: _other and 1000 documents with count 1 with different service names.
  3. ✔️ ✔️ Above steps tested twice
Bash script to call load generator
#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

for i in {1..2000}
do
    ELASTIC_APM_SERVICE_NAME="random$i" TXGROUPS="1" ./main &
done

wait

@carsonip carsonip self-assigned this Mar 28, 2023
@carsonip
Copy link
Member

Testing notes (service transaction metrics)

Loadgen script

Sample load generator for 1GB server
package main

import (
	"fmt"
	"os"
	"strconv"
	"time"

	"go.elastic.co/apm/v2"
)

func main() {
	tracer := apm.DefaultTracer()
	g, err := strconv.Atoi(os.Getenv("TXTYPES"))
	if err != nil {
		panic(err)
	}
	for i := g; i >= 1; i-- {
		once(tracer, fmt.Sprintf("type%d", i))
		time.Sleep(time.Millisecond)
	}
	tracer.Flush(nil)
	fmt.Println("ok finished publishing ", g)
}

func once(tracer *apm.Tracer, name string) {
	tx := tracer.StartTransaction("txname", name)
	defer tx.End()

	span := tx.StartSpanOptions(name, "type", apm.SpanOptions{})
	time.Sleep(time.Millisecond * 1)

	span.Outcome = "success"
	span.Context.SetDestinationService(apm.DestinationServiceSpanContext{
		Resource: fmt.Sprintf("dest_resource"),
	})
	span.End()
}

How to test these changes

  1. Run APM-Server with 1GB memory limit.
  2. Check the logs of APM-Server to validate the configuration for MaxGroups. The logs are in the format creating service transaction metrics aggregation with config: {MaxGroups:1000 HDRHistogramSignificantFigures:2}. Assert that MaxGroups is gb_available*1_000.
  3. ✔️ Confirm that MaxGroups is 1000.

Scenario 1: Max groups limit reached:

  1. Test with 2000 different tx types.
  2. ✔️ (In interval=1m index) Confirm that there are total 1001 hits and record count 2000. 1 metric document with count 1000 is published with service.name: _other and 1000 documents with count 1 with different tx types.
  3. ✔️ ✔️ Above steps tested twice
Bash script to call load generator
#!/bin/bash

export ELASTIC_APM_SECRET_TOKEN=<fixme>
export ELASTIC_APM_SERVER_URL=<fixme>
export ELASTIC_APM_LOG_FILE=stderr

ELASTIC_APM_SERVICE_NAME="fixed" TXTYPES="2000" ./main &

wait

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants