Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rule_psr and limit_psr metrics to improve trace ingestion rate #45

Merged
merged 2 commits into from
Dec 11, 2021

Conversation

mrz
Copy link
Contributor

@mrz mrz commented Nov 5, 2021

No description provided.

@GregMefford
Copy link
Member

Hey! Thanks for taking the time to research this and make a pull request! ❤️
I started to look into these attributes a bit and it's not clear to me what they mean and how they should be used. From looking at some of the other first-party Datadog libraries, it seems like these would tell Datadog about the sampling rate and rate limiting that is being applied to the traces within the application, so that they can estimate metrics based on traces that they can assume we did not send them. Since Spandex doesn't implement percent sampling or rate-limiting, I think not including these metrics might behave the same as including them with static values of 1.0. Do you know of any documentation that I can refer to about this or a test case that I can set up to demonstrate the current vs. expected behavior?

@mrz
Copy link
Contributor Author

mrz commented Nov 19, 2021

Hey Greg,

the only "documentation" I have is the feedback we received when we contacted Datadog support to help us in figuring out this issue. This is the most relevant snippet regarding the topic:

We've heard back from our engineering team regarding this case and they confirmed that the best path forward would be to set the tags in the appropriate places.

_dd.rule_psr and _dd.limit_psr is what tells the backend that a root span has been processed by a tracer with a sampling rule for its service. They’re needed for a service to appear as configured in the ingestion page.

“Default” means that none of the traces that we received (for this service) had those metrics
”Partially configured” means that some of the traces had those metrics
”Configured” means that all the traces had those metrics

Therefore you would want to set _dd.rule_psr and _dd.limit_psr for the root spans to have the ingestion page report the correct configuration numbers.

_sampling_priority_v1 is what tells the Datadog agent to sample the span that’s coming in. This should be set on all of the spans if you want all of them be ingested.

And as you said,

Since Spandex doesn't implement percent sampling or rate-limiting, I think not including these metrics might behave the same as including them with static values of 1.0

I kind of believe that this fix is just so that Datadog knows that Spandex is sending everything and is able to properly report this in the Ingestion Control page (APM -> Setup & Configuration -> Ingestion Control). Before this change, the service was reporting very low ingestion rate. After the patch, nothing changed in terms of actual spans/traces in the service, but in the Ingestion Control page we now have 100% ingestion rate and "Fully Configured" tracer configuration.

@GregMefford
Copy link
Member

Ah! Thank you, that’s very helpful! I guess since we always set a sampling priority, DD is confused because we are telling them to sample all of them, but aren’t telling them that this represents 100% of the actual traces. 👍

@GregMefford GregMefford merged commit eaf93d3 into spandex-project:master Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants