Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart Agent receiver panics because of a regression in SignalFx exporter #2508

Closed
atoulme opened this issue Jan 23, 2023 · 2 comments
Closed

Comments

@atoulme
Copy link
Contributor

atoulme commented Jan 23, 2023

The Smart Agent Receiver uses the SignalFx exporter to send data points to our backend.
It does so by finding the instance of the exporter registered as part of the configuration of the collector here:

sfxExporter := getLoneSFxExporter(host, component.DataTypeMetrics)

However, we have now changed the way the SignalFx exporter functions, especially as it now obeys the Start/Shutdown lifecycle.
This change is effective in 0.69.0 with this PR: open-telemetry/opentelemetry-collector-contrib#16837

The PR changes the way we initialize the HTTP clients used by the exporter to communicate to our backend so they better adhere to the component lifecycle of the collector, so we may use the bundled config library that helps us configure the clients with confighttp.HTTPClientSettings.

This means that getting the exporter and calling methods on it may panic if the exporter has not started.

This was reported here:
#2465 (comment)

The stacktrace is reproduced below:

{"line":"\[tgithub.com/signalfx/[email protected]/pkg/utils/time.go:46](https://tgithub.com/signalfx/[email protected]/pkg/utils/time.go:46) +0xb8","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"created by [github.com/signalfx/signalfx-agent/pkg/utils.RunOnInterval](https://github.com/signalfx/signalfx-agent/pkg/utils.RunOnInterval)","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"\[tgithub.com/signalfx/[email protected]/pkg/utils/time.go:47](https://tgithub.com/signalfx/[email protected]/pkg/utils/time.go:47) +0x63","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[github.com/signalfx/signalfx-agent/pkg/utils.RunOnInterval.func1()](https://github.com/signalfx/signalfx-agent/pkg/utils.RunOnInterval.func1())","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"\[tgithub.com/signalfx/[email protected]/pkg/monitors/ecs/ecs.go:111](https://tgithub.com/signalfx/[email protected]/pkg/monitors/ecs/ecs.go:111) +0x32b","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[github.com/signalfx/signalfx-agent/pkg/monitors](https://github.com/signalfx/signalfx-agent/pkg/monitors)/ecs.(*Monitor).Configure.func1()","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"\[tgithub.com/signalfx/[email protected]/pkg/monitors/ecs/ecs.go:211](https://tgithub.com/signalfx/[email protected]/pkg/monitors/ecs/ecs.go:211) +0x9c8","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[github.com/signalfx/signalfx-agent/pkg/monitors](https://github.com/signalfx/signalfx-agent/pkg/monitors)/ecs.(*Monitor).fetchStatsForAll(0xc0009cac00, {0x42?, 0x0?, 0x0?, 0x0?})","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"\[tgithub.com/signalfx/splunk-otel-collector/receiver/[email protected]/output.go:275](https://tgithub.com/signalfx/splunk-otel-collector/receiver/[email protected]/output.go:275) +0x46d","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[github.com/signalfx/splunk-otel-collector/receiver](https://github.com/signalfx/splunk-otel-collector/receiver)/smartagentreceiver.(*output).SendDimensionUpdate(0xc000f4bf40, 0xc0014155c0)","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"\[tgithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/exporter.go:61](https://tgithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/[email protected]/exporter.go:61) +0x1d","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[github.com/open-telemetry/opentelemetry-collector-contrib/exporter](https://github.com/open-telemetry/opentelemetry-collector-contrib/exporter)/signalfxexporter.(*signalfMetadataExporter).ConsumeMetadata(0xc000d7b4a8?, {0xc0000126f8?, 0x523ea66?, 0xc?})","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"goroutine 144 [running]:","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10a595d]","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"panic: runtime error: invalid memory address or nil pointer dereference","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}
{"line":"2023-01-23T20:14:01.610Z\tinfo\tservice/pipelines.go:109\tReceiver started.\t{\"kind\": \"receiver\", \"name\": \"smartagent/ecs-metadata\", \"pipeline\": \"metrics\"}","source":"stderr","tag":"splunk-otel-collector/633f510a267d"}

The panic originates specifically on this line:

return sme.pushMetadata(metadata)

pushMetadata here is a field of the struct, which is set during construction of the newMetricsReceiver function (see https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/647ce4f91e0c010d7937ae481cc90eaceb870d65/exporter/signalfxexporter/factory.go#L153):

	return &signalfMetadataExporter{
		Metrics:      me,
		pushMetadata: exp.pushMetadata,
	}, nil

With the change to adopt start/shutdown, the exp.pushMetadata is only set during start. So this exporter has a nil field, which panics when we try to export data points: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/647ce4f91e0c010d7937ae481cc90eaceb870d65/exporter/signalfxexporter/exporter.go#L175

This regression is present in the upstream collector. It should be relatively easy to create a unit test to reproduce.

@atoulme
Copy link
Contributor Author

atoulme commented Feb 1, 2023

This is fixed in 0.70.0. @jamie1911 please try out this new release and let us know if you run into any issues, thanks!

@atoulme atoulme closed this as completed Feb 1, 2023
@jamie1911
Copy link

This is fixed in 0.70.0. @jamie1911 please try out this new release and let us know if you run into any issues, thanks!

Thanks, I set the task definition back to latest a few days ago and it was able to start monitoring again. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants