Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support metrics using System.Diagnostics.Metrics #785

Open
jviau opened this issue Aug 17, 2022 · 5 comments
Open

Support metrics using System.Diagnostics.Metrics #785

jviau opened this issue Aug 17, 2022 · 5 comments
Labels
dt.core DurableTask.Core

Comments

@jviau
Copy link
Collaborator

jviau commented Aug 17, 2022

With the release of .NET 6 last year, a new metrics API was introduced. This is available in System.Diagnostics.DiagnosticSource 6.0 package, which is backwards compatible with older .net runtimes (so we do not need to target .NET 6)

https://docs.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation

We should use this API to emit metrics for select DTFx scenarios. Customers can then listen to these metrics themselves and export them out of process appropriately, or use an existing SDK like OpenTelemetry to export them.

We can start with building a list of metrics we want to collect, their names, value significance, and any dimensions.

Relies on #698

@cgillum cgillum added the dt.core DurableTask.Core label Aug 17, 2022
@cgillum
Copy link
Member

cgillum commented Aug 17, 2022

One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.

@jviau
Copy link
Collaborator Author

jviau commented Aug 17, 2022

Metrics

Core

Name Instrument Type Unit Unit (ucum) Description
durabletask.task.limit Async UpDownCounter default unit {concurrent_task_limit} The configured limit of concurrent tasks for this worker. Attributes will define orchestration vs activity.
durabletask.task.current Async UpDownCounter default unit {conccurent_task_current} The current concurrent tasks (activity or orchestration) running on the worker. Attributes will define orchestration vs activity.
durabletask.task.duration Histogram milliseconds ms Measures the duration of a task.
durabletask.task.count Counter default unit {task_count} The number of tasks that have been processed.
durabletask.errors Counter default unit {errors} Number of task invocation errors.

Azure Storage

Name Instrument Type Unit Unit (ucum) Description
durabletask.azure_storage.partition.delay Histogram milliseconds ms Measures the delay of the messages as they are dequeued.
durabletask.azure_storage.partition.length Async UpDownCounter default unit {item_count} The count of messages in a partition.
durabletask.azure_storage.errors Counter default unit {errors} Number of task invocation errors.

note: should we include or exclude azure_storage section?

Attributes

Core

Name Requirement level Description Examples
durabletask.task.type Required The type of task being ran. SHOULD be one of: activity, orchestration `
durabletask.task.name Required The name of the task being ran. Example MyOrchestration MyOrchestration, MyActivity
durabletask.task.version Conditionally Required The version of the task being ran. Omitted when version is null. 0, 1, v1
durabletask.task.status_code Required The status code of a completed task. This will be the terminal state of the task. succeeded, failed, terminated, canceled
durabletask.task.sub_status_code Optional This is a consumer supplied string [1], think of an open-ended HTTP status code my_failure_reason, other_failure_reason

[1]: May need to think about this more. But I see value in having a more granular code for failure reason. It is valuable to differentiate in monitors between expected/transient and unexpected/important failures.

Azure Storage

Name Requirement level Description Examples
durabletask.azure_storage.partition.name Required The name of the partition represented in this metric. {hubname}-workitems, {hubname}-control-01

note: DTFx orchestration service packages SHOULD still include Core attributes when possible.

@jviau
Copy link
Collaborator Author

jviau commented Aug 17, 2022

One thing we've received asks for by customers is metrics for Azure Storage, like queue length. It's obviously specific to the DurableTask.AzureStorage backend but it would be useful if the different backends could add their own metrics as part of this work.

Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.

edit: added dtfx.partition.length above.

@cgillum
Copy link
Member

cgillum commented Aug 19, 2022

Yeah that is definitely important. But I do wonder if that is something DTFx should implement? Or should Azure Storage be responsible for that? I guess DTFx could add one for now, but have it opt-in only via some startup value.

The problem with reporting it from DTFx is that DTFx doesn't have any concept of queues, partitions, or even work-item latency today. If we want DTFx to be able to report this, then we'll probably need to add some optional interface that the backends can implement to surface this information to DTFx.Core.

I see you added dtfx.partition.name and dtfx.partition.length. It's a little strange since not all backends have the concept of partitions (MSSQL doesn't - less sure about Service Bus). I suppose for those kinds of orchestration services, they could just report having one "default" partition, which is the full backlog size?

@jviau
Copy link
Collaborator Author

jviau commented Aug 19, 2022

@cgillum each DTFx orchestration service library can emit its own metrics. In this case, DurableTask.AzureStorage should be emitting those metrics under its own Meter. I will update my table to make that more clear.

Edit: I have separated example metrics and attributes between Core and AzureStorage concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dt.core DurableTask.Core
Projects
None yet
Development

No branches or pull requests

2 participants