Datadog traces support in the datadog_agent
source was initially documented in this RFC. However, the
Datadog trace-agent
submits more data than just traces. It also sends statistics ("APM Stats") about the running time of
each instrumented resource (i.e. a given piece of code) that are aggregated over time (based on 100% of the traces
received). Those APM stats are very important as they highlight code hot spots and ease aggregation. And, as a result, APM stats should be handled by Vector.
Traces are new in Vector, initial support focuses on traces coming from the Datadog Agent, the RFC discussing traces describes how those traces will be handled in Vector and highlights the need to support APM stats.
APM stats are encoded using Protobuf, the current schema is available in the datadog-agent
repository. APM stats may be computed by the tracing lib in some cases, or most of the time by the trace-agent
, but once
they are emitted by the trace-agent
there is no difference except a boolean value that indicates
where it was computed. They are sent by the trace-agent
to the same endpoint as trace payloads, only the paths differ,
allowing easy discrimination between trace & APM stats payloads.
Those stats are computed by a component named concentrator in the trace-agent
. There is a dedicated
path for APM stats that comes directly from tracing libraries. But ultimately they flow
through Datadog with the same sending code.
Below is an example of a stats payload, it is an aggregate of ClientGroupedStats
:
string service = 1;
string name = 2;
string resource = 3;
uint32 HTTP_status_code = 4;
string type = 5;
string DB_type = 6; // db_type might be used in the future to help in the obfuscation step
uint64 hits = 7; // count of all spans aggregated in the groupedstats
uint64 errors = 8; // count of error spans aggregated in the groupedstats
uint64 duration = 9; // total duration in nanoseconds of spans aggregated in the bucket
bytes okSummary = 10; // ddsketch summary of ok spans latencies encoded in protobuf
bytes errorSummary = 11; // ddsketch summary of error spans latencies encoded in protobuf
bool synthetics = 12; // set to true on spans generated by synthetics traffic
uint64 topLevelHits = 13; // count of top level spans aggregated in the groupedstats
As you can see, it is a group of various metrics that can be represented in Vector as such (Sketches are supported since this PR). In the proto definition sketches are stored as unstructured bytes slices, but those fields are filled with a protobuf encoded ddsketch. Given that sketches in Vector are also heavily based on ddsketch, APM stats sketches can be converted to/from the Vector internal representation without incurring too much accuracy loss, but this would require significant work to implement that conversion.
This opens two major very different paths for APM stats in Vector:
- Either the APM stats are emitted as a log, each
ClientGroupedStats
would be mapped to one log event. - Or they are emitted as metrics. Each value from every
ClientGroupedStats
would be emitted as a metric with all upper level information stored as tags. Each metric would flow independently from others and this would be require significant re-aggregation logic in thedatadog_traces
sink.
Those two approaches can be mixed together if we introduce one of the following abilities to bundle multiple metrics into a single event:
- Allow multiple metric samples into a single metric event
- Allow log events to embed metrics, it would mean adding a
Metric
type to theValue
enum
This also raises the question of having a second (assuming that the datadog_agent
sources accepts Datadog Agent
metrics - RFC), unrelated, metric stream coming out of the datadog_agent
source. The
Vector event ingested representing APM stats will have to be routed along with traces, and most often they will follow a
different path that other plain metrics/logs received from a core-agent
. Thus it is suggested to re-arrange the
datadog_agent
source, many option are available (Additional details on what exactly are "Datadog Agents" can be
found in the trace support RFC and may provide relevant context for undermentionned points):
- Keeping a single
datadog_agent
source:- and add a settings to switch between agent kind:
agent: <TYPE>
where<TYPE>
could becore
(would support metrics & logs - we could optionally addlogs
&metrics
to only allow logs or metrics, along withcore
that would allow both),trace
and could be extended to supportprocess
,security
and so on - or introduce the ability for a source to have multiple output for the
datadog_agent
source, e.g. <SRC_ID>.metrics, <SRC_ID>.logs, <SRC_ID>.traces, <SRC_ID>.apm_stats, etc.
- and add a settings to switch between agent kind:
- Another solution would be to keep one Vector source per kind of Datadog Agent an then we would have the following
Vector sources:
datadog_core
or the currentdatadog_agent
sources that would receive the data sent by the Datadog "core" Agent (the agent that collects logs & metrics).datadog_trace
that would support all data sent by thetrace-agent
- And so on as the support list grows:
datadog_process
source for the Datadogprocess-agent
datadog_security
for the Datadog security Agent- Etc.
- Ongoing work on transforms to add
named_outputs
that is laying the ground for the same feature but onsources
, one PR has already be merged while scheduled work is tracked here. - Ongoing work on schemas will ultimately offer a programatic way of validating required fields and express constrains on incoming event for a given sink. Traces & APM stats are a good fit for that because they will be represented as standard Vector events, but sinks handling thos will expect some mandatory information.
- An official crate for dd-sketches is being worked on.
- Decode APM request (protobuf) received from the trace agent
- Convert those into a Vector internal representation
- Enable the passthrough use case (trace-agent -> Vector -> Datadog) in a lossless fashion
- Compute APM stats in Vector, but this should be kept in mind as a valuable feature for third party traces
- Any support for other kind of traces
- Vector has no trace support
- Datadog traces support without APM stats makes the whole APM product much less powerful.
The Vector datadog_agent
source would accept all supported data type including APM stats (along with traces), and emit
Vector event (logs or metrics depending on the implementation) including all metadata as tags/fields, so filtering could
be done later in the topology on both APM stats and traces.
In order to avoid complex and unreliable route
transforms to properly differenciate logs from traces (as the latter
will be represented as logs inside Vector), and plain metrics (received from the core Agent) from APM stats metrics
(received from the trace-agent
) we can plan to extand the behaviour that was added to the remap
transform. This would translate to the following kind of config, easy to read, easy to adapt:
[sources.dd_agents]
type = "datadog_agent"
address = "[::]:8081"
[sinks.dd_traces]
type = "datadog_traces"
inputs = ["dd_agents.traces", "dd_agents.apm_stats" ]
[sinks.dd_logs]
type = "datadog_logs"
inputs = ["dd_agents.logs"]
[sinks.dd_metrics]
type = "datadog_metrics"
inputs = ["dd_agents.metrics"]
[sinks.debug]
type = "console"
# Optionally the non-suffixed name could receive everything, this will be configurable
inputs = ["dd_agents"]
encoding.codec = "json"
The datadog_trace
sink will receive those event (metrics and/or log depending on the implementation) and do the
opposite conversion, pending that expected tags will be there.
Regarding the Datadog trace-agent config, the APM stats endpoint is the same as the trace one (apm_config.apm_dd_url
config key), there will be nothing else to configure.
And finally API key management will be the same as it is for other Datadog sources/sinks.
Each group is relatively independant:
- Reorganise the
datadog_agent
source:- Extend the
named_outputs
feature, that is available four transforms, to sources so they can expose multiple named outputs (<SRC_ID>.<OUTPUT_NAME>)
). The feature for transform was initialy add in this PR, subsequent work on it is tracked [here][named-outputs-improvements. - Add the following
named_outputs
in thedatadog_agent
:<SRC_ID>.traces
,<SRC_ID>.apm_stats
,<SRC_ID>.metrics
,<SRC_ID>.logs
. Note that the non suffixed output should have a predictable behaviour, so we could add a knob namedtop_level_output
that would allow the user to choose which data to get out of of the suffix-less output.
- Extend the
APM stats support would be done according to the following plan:
- Import all APM stats as standard vector metrics
- Turn each
ClientGroupedStats
into relevant metrics all possible metadata to allow the lossless pass-through scenario and the same level of filtering/routing we can achieve for traces. APM stats sketches would then be converted to the Vector internal sketches. Vector internal sketches would then get parameterizedgamma
andmaxbin
that would still default to the agent sketch value. - The upcoming
datadog_traces
sink would then receive incoming APM stats metric along with traces. It will reaggregate those metrics according to the relevant dimensions using thePartitioner
trait to rebuild APM stats payloads:- Incoming metrics will be buffered, and populate a struct matching the APM stats base object, those struct will be stored in a map according to the very same kind of [keys][trace-stats-agg-key] used by the trace-agent.
- And every 10 seconds (this is the sending interval of the trace-agent) serializing and flushing those to Datadog. To account for late metrics the sink would have to keep 2 or 3 buckets in the past and delay flushing accordingly. This would rely on the [bucket timestamp][btime] kept by the trace agent and [stored in APM stats payload][csb-start].
- Turn each
At a certain point in time, when sketches-rs will be production ready:
- Swich Vector sketches to use the crate sketches-rs instead of the Agent based implementation
- Then relocate conversion logic: sketch conversion will then be effectively useless in traces handling, but the
datadog_agent
source and thedatadog_metrics
sink will then have to handle conversion between vector sketches (plain ddsketches) and the agent variation (note: it's on the crate roadmap to offer that kind of conversion)
[btime]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/stats/concentrator.go#L148-L159 [csb-start]: https://github.com/DataDog/datadog-agent/blob/dc2f202/pkg/trace/pb/stats.proto#L47
- We should Keep valuable metrics relevant to end-user
- Dropping APM stats would cause current user to lose some insight on execution time.
None identified so far.
N/A.
Regarding the fact that we could ignore/drop incoming APM stats:
- Either completely drop APM stats, but this is not really an option as it would lead to user experience degradation
- Or disable sampling on the
trace-agent
side and compute APM stats in thedatadog-trace
sink, this could work but this is a lot for a initial implementation (it would required plain ddsketch suppor on top of the computation logic) and to match the accuracy of current APM stats, Vector would have to receive 100% of traces, which may not always be possible. But this would pave the wayfor generic APM stats computation wherever the traces come from.
Regarding the internal representation, APM stats could alternatively be represented either by a log event with some numerical fields. As stated above a hybrid approach like allowing a log event to have metric fields or introducing metric event that could hold multiple value could also be a solution.
Regarding sketches, thos from APM stats are not exactly the same as the internal representation we have in Vector, thus converting them to the internal representation will required some plumbing this could be avoided by not decoding those sketches as all and keeping those as opaque data/raw bytes slices inside Vector.
About the source(s) reorganisation an alternative to avoid the work to implement the <source_id>.<suffix>
an
alternative would be to handle different Datadog Agent in a dedicated source:
- Either the
datadog_agent
source is adjusted to be configurable with antype
settings (that could be set tologs
,metrics
ortraces
- Or source types are mapped to Datadog types:
datadog_logs
,datadog_metrics
&datadog_traces
(datadog-agent
would probably became an alias fordatadog_logs
ordatadog_metrics
before being deprecated),
This would lead to the following config, functionnally identical to the snippet above, a bit longer but still very straighforward and easily readable (note that having multiple binding addresses may translate to more parameter in later work around helm charts):
[sources.dd_in_logs]
type = "datadog_logs"
address = "[::]:8081"
[sources.dd_in_metrics]
type = "datadog_metrics"
address = "[::]:8082"
[sources.dd_in_traces]
type = "datadog_traces"
address = "[::]:8083"
[sinks.dd_traces]
type = "datadog_traces"
inputs = ["dd_in_traces" ]
[sinks.dd_out_logs]
type = "datadog_logs"
inputs = ["dd_in_logs"]
[sinks.dd_out_metrics]
type = "datadog_metrics"
inputs = ["dd_in_metrics"]
[sinks.debug]
type = "console"
inputs = ["dd_in_*"]
encoding.codec = "json"
None.
- Implement the multiple outputs per source option
- Implement APM stats decoding to Vector metrics
- Add APM support to the
datadog_traces
sinks
Depending on the timeline of sketches-rs this point is at the edge between this section and the next one:
- Switch Vector to use sketches-rs internally instead of the Agent variant
- Compute APM stats in the
datadog_traces
sink for any trace format. - Overall most improvements suggested in the Datadog trace RFC applies here, but having constraints on the schema use (in the case we represent APM stats as log event) would be very useful here.