-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric aggregation public surface changes #323
Metric aggregation public surface changes #323
Conversation
this.Sum = Utils.SanitizeNanAndInfinity(this.Sum); | ||
this.Min = Utils.SanitizeNanAndInfinity(this.Min); | ||
this.Max = Utils.SanitizeNanAndInfinity(this.Max); | ||
this.StandardDeviation = Utils.SanitizeNanAndInfinity(this.StandardDeviation); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not sanitize them on input (set) instead of as a separate step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied the way it was done in other types. Are you guys proposing to change it for all types (that would make sense)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have context on the history behind it but if there isn't a strong reason to do it like this I would propose changing it. Let's ask around in the morning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking more into this and I think I'd have to leave it the was it was coded originally. Sanitization is considered a part of the serialization, actually, and if we change it only for aggregated metric telemetry type, it will look odd and different to all other types. I'd prefer, if you like, to document (via a story) a future effort to clean that up across all telemetry types.
/// <returns>Value aggregator for the metric specified.</returns> | ||
public MetricAggregator CreateMetricAggregator(string metricName, IDictionary<string, string> dimensions = null) | ||
{ | ||
throw new NotImplementedException(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I assume that the various "not implemented" methods will be implemented in the future?
Should I also assume that at some point the now marked as obsolete class MetricTelemetry will be deleted in favor of the new AggregatedMetricTelemetry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a PR for public surface only (from one private branch to the other, not to develop yet). Naturally, "not implemented" methods will be implemented by the time we go to "develop".
As far as obsolete things - we can totally remove them in the next major version. For now they're marked obsolete and hidden from intellisense...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thank-you for confirming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left couple of interesting comments
this.Name = this.Name.SanitizeName(); | ||
this.Name = Utils.PopulateRequiredStringValue(this.Name, "name", typeof(AggregatedMetricTelemetry).FullName); | ||
this.Properties.SanitizeProperties(); | ||
this.Count = this.Count < 0 ? 0 : this.Count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be 1, not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comment as above - why do you think so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SET COUNT HERE TO "1"?
/// </summary> | ||
public int Count | ||
{ | ||
get { return this.Metric.count ?? 0; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the default be 1, not zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be zero. Why do you think it should be 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that Count
represents the "base" for the value. It should not be zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. the Sanitize
method will change it from 0 to 1 so the default value examined through the object will be different than the value sent on the wire..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sanitize will set Count to 1 only in case when AggregatedMetricTelemetry was constructed w/o setting "Count" property (or explicitly setting it to <= 0). We consider this a mistake and I initially was setting it to "0" but that may present some problems in the pipeline...
The statement about different value on the wire and through property getter is not unique to this particular case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's why @SergeyKanzhelev and I are saying that Count
should return this.Metric.count ?? 1
. Otherwise Count
returns incorrect value, 0
, when it hasn't been set explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. Also added unit test.
@@ -236,9 +236,11 @@ private static void SerializeTelemetryItem(ITelemetry telemetryItem, JsonWriter | |||
ExceptionTelemetry exceptionTelemetry = telemetryItem as ExceptionTelemetry; | |||
SerializeExceptionTelemetry(exceptionTelemetry, jsonWriter); | |||
} | |||
#pragma warning disable CS0618 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you create a serializer for the AggregatedMetricTelemetry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did. I was omitted in this commit but will be there for the next one (will be available very shortly)
/// <param name="min">Minimum value taken during aggregation interval.</param> | ||
/// <param name="max">Maximum of values taken during aggregation interval.</param> | ||
/// <param name="standardDeviation">Standard deviation of values taken during aggregation interval.</param> | ||
public AggregatedMetricTelemetry( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please expose the constructor without sum, min, max, etc. Mostly to report performance counter values, but also for the back compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expose parameter-less ctor that can be used along with property setting to populate whatever properties I like. I long considered adding others or making, say, min/max optional in some way on the ctor that takes all properties, but ruled that out at the end. I want user to explicitly state what they want to do and not omit anything. I think the schema I have right now is best, but we can discuss offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this constructor? It's unreadable at the point of call.
new AggregatedMetricTelemetry("foo", 1, 2, 3, 4, 5);
You could make it readable by supplying parameter names, but you can achieve the same effect with property syntax.
new AggregatedMetricTelemetry("foo", count: 1, sum: 2, min: 3, max: 4, standardDeviation: 5);
new AggregatedMetricTelemetry { Name = "Foo", Count = 1, Sum = 2, Min = 3, Max = 4, StandardDeviation = 5);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ctor with parameters exists for the case when people aggregated entirely outside our Sdk and just want to send the results to AI. It states explicitly what is expected. If you use parameter-less ctor and property setters, it is not entirely clear what to set to get consistent results. This is somewhat against the purity of the design, but this is where we hit cold reality. We need to capture most common cases of providing metric data and make sure we do not create a large set of concepts. I thought having two concepts combined into one: a) metric that is just count (sum and count) and I do not care about the rest of the properties and b) metric where all properties are interested; would be Ok.
Therefore, people using parameter-less ctor may choose properties they want to set based on what they are going to look at. People using the other ctor know all things that need to be set if you want to go "comprehensive".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we need a constructor that communicates intent but is unreadable when used. I guess I could argue this point both ways 😏. Let's keep it as is.
this.Properties.SanitizeProperties(); | ||
this.Count = this.Count < 0 ? 0 : this.Count; | ||
this.Sum = Utils.SanitizeNanAndInfinity(this.Sum); | ||
this.Min = Utils.SanitizeNanAndInfinity(this.Min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this call will initialize nullable field. Which is bad as we will need to pass the value over the wire even if we do not want to. Typical example is performance counter value. I do not want to spent bytes over the wire to send the min
and max
which are equal to the current value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a unit test that will ensure that not setting min
and max
will result to not serializing of those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think those fields should be optional and we do not need to send them if they are not available. Let's discuss
Please note an update on this PR. This is a "nearly" final version. I need to think about a couple of things here and I have tracing not yet done. Otherwise, unless you point out holes and/or problems I propose we consider this as a "we're done" stage and try use it in internal efforts such as Sergey's exception stats feature. |
…end. Added tracing
…same aggregator instance for the same metric name and dimensions and does not create new aggregator instance on each call. Also changed method name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find where you limit the number of dimensions as we discussed on the previous sync up, But it may be a separate PR
/// <summary> | ||
/// Name of the property added to aggregation results to indicate duration of the aggregation interval. | ||
/// </summary> | ||
private static string intervalDurationPropertyName = "IntervalDurationMs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need it? I thought we agreed that we assume 1 minute as a default interval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, will keep to address the shortened intervals at the start and shutdown times for the application.
[Event( | ||
27, | ||
Message = "Failed to invoke metric processor. Exception: {0}.", | ||
Level = EventLevel.Error)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Level "Error" means that this trace will be sent to the portal as SDK "self-diagnostics" event. To make it more understandable for the user I'd recommend to mark it "User Actionable", trace the processor name and write instruction that "if issue persists - disable this processor"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERROR IS CUSTOMER FACING. Get type of the processor. User-actionable. Suggest turning off processor if persists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
this.cancellationSource = new CancellationTokenSource(); | ||
|
||
this.snapshotTask = new Task(this.SnapshotRunner, TaskCreationOptions.LongRunning); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the third thread SDK runs in background. QP configuration, Performance Counters collection and this aggregator. Can you think of any options how to combine this thread with Performance Counters collection thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is lower priority comment, we can convert it to "enhancement" issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to address later if these keep breading...
this.Properties.SanitizeProperties(); | ||
this.Count = this.Count < 0 ? 0 : this.Count; | ||
this.Sum = Utils.SanitizeNanAndInfinity(this.Sum); | ||
this.Min = Utils.SanitizeNanAndInfinity(this.Min); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think those fields should be optional and we do not need to send them if they are not available. Let's discuss
/// Gets the list of <see cref="IMetricProcessor"/> objects used for custom metric data processing | ||
/// before client-side metric aggregation process. | ||
/// </summary> | ||
public IList<IMetricProcessor> MetricProcessors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name processor
makes it inconsistent with the TelemetryProcessors
which combines the chain rather than sequential processing. Do you have any other names in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was changed to "processor" from "metric stream listener"to make it more consistent :) I'll leave it as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Sergey. Suggestion:
type name: IMetricValueTracker
property name: IList RawMetricValueTrackers
/// <summary> | ||
/// Aggregator manager for the aggregator. | ||
/// </summary> | ||
private MetricAggregatorManager manager; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
/// <summary> | ||
/// Metric aggregator id to look for in the aggregator dictionary. | ||
/// </summary> | ||
private string aggregatorId; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/// <summary> | ||
/// Aggregator hash code. | ||
/// </summary> | ||
private int hashCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
// note: we set count to 1 if it isn't a postitive integer | ||
// thinking that if it is zero (negative case is clearly broken) | ||
// that most likely menas somebody set the sum but forgot to set count |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mistype "menas" instead of "means"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) Fixed
/// <summary> | ||
/// Gets or sets the min value of this metric. | ||
/// </summary> | ||
public double Min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the non-nullable double
type indicate that Count
, Min
, Max
and StandardDeviation
properties are required to have value? With the current implementation of Sanitize
it is possible to supply Sum
and Count
values only, and have the Min
, Max
and StandardDeviation
values "default" to 0. Is that indentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was intentional.I do not see a value in making these nullable. Nullable things are more difficult to work with and AI UX does not treat these any different than 0s. Effectively, we again getting into case where we're trying to represent multiple concepts via one class and therefore > 50% of it is nullable...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Thanks for the explanation.
/// <summary> | ||
/// Metric aggregator snapshotting task. | ||
/// </summary> | ||
private Task snapshotTask; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reuse the TaskTimer
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of TaskTimer existence. The answer is "almost". TaskTimer fires at regular intervals while perf counter reporting tasks changes interval after each invocation if needed. We can change the TaskTimer, of course, of create a separate class for this one usage...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think TaskTimer
fires once, once the TimeSpan
specified by its Delay
property elapses. You can adjust the Delay
property from the elapsed
function and start again with the new delay. With the benefit of perfect hindsight, this is not the most elegant design, but I think it should get the job done. I do think that timer logic is tricky enough to invest in a single robust implementation that's easy to use for aggregation as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at TaskTimer. I understand what it is doing and it would [almost] work for me if I were to have functionality (in it) to do something if wait is terminated (I need to try to do Flush() to attempt not to lose any telemetry I can salvage). I considered the problem and I think it is too large to be coupled with the change I'm doing.
I suggest separating the area that covers two questions: A) How many timer implementations does one library need? and B) How many utility threads does one library need to run? and tackling it separately. We also have a perfect guy to propose the best solution - Greg. I chatted with him and once he is done with his current thing (should be just a few days) he'll look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take that back. Don't know what I was thinking. I can do it a lot simpler. Will use TaskTimer
/// <summary> | ||
/// Cancellation token source to allow cancellation of the snapshotting task. | ||
/// </summary> | ||
private CancellationTokenSource cancellationSource; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can any of these fields be readonly
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, token and the task itself. Added readonly attributes
{ | ||
// create a local reference to metric processor collection | ||
// if collection changes after that - it will be copied not affecting local reference | ||
IList<IMetricProcessor> metricProcessors = this.manager.Client.TelemetryConfiguration.MetricProcessors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ouch. This is a pretty bad violation of the law of Demeter. From customer viewpoint, this means that in order to test the use of MetricAggregator
in my code, I'll need to understand a chain of objects 4 levels deep and create it for every test. In practice, I think few people will go through the effort successfully and, like me, will decide to wrap the MetricAggregator
and the MetricAggregatorManager
in application abstractions as "untestable external dependencies" instead.
Aside from the testability aspect, this is a pretty strong design smell of MetricAggregator
performing responsibilities that belong somewhere else, perhaps in the SimpleMetricStatisticAggregator
or in a class that doesn't exist yet, like a MessageProcessorForwarder: IMetricProcessor
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same pattern as with regular .Net events. May not be necessary. Lets chat about it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a fix for this. Technically I'm shifting responsibilities by encapsulating a bit more. Do you see this as acceptable fix (or at least a step in the right direction) or more of a hack?
/// <summary> | ||
/// Represents aggregator for a single time series of a given metric. | ||
/// </summary> | ||
public class MetricAggregator : IEquatable<MetricAggregator> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we implementing IEquatable
in MetricAggregator
only for the sake of the MetricAggregatorManager.GetStatisticsAggregator
? It's a lot more (untested 😱) code in the MetricAggregator
and harder to understand in the MetricAggregatorManager
. Why not simply use MetricAggregator.aggregatorId
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did use aggregator id directly initially. Then I decided and can cache the hash code speeding up lookups. That idea led to current design...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll settle for unit tests then 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests added.
public class MetricAggregatorManagerTest | ||
{ | ||
[TestMethod] | ||
public void MetricAggregatorMayBeCreatedForMetricHavingNoDimensions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt these integration tests were hard to understand for someone who is not already familiar with all 3 classes implementing metric aggregation. Arrange, Act, Assert comments would help. If it's possible at this point, factoring the classes to allow unit testing them separately would help more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not entirely understand. We'll need to chat offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. For readability, I'd be happy if the tests looked like this.
public void TestName()
{
// Arrange
...
// Act
...
// Assert
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments where I thought it was not obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @vitalyf007! I feel these tests are much easier to read with the arrange/act/assert comments.
using Assert = Xunit.Assert; | ||
|
||
[TestClass] | ||
public class MetricAggregatorTest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we missing tests for the IEquatable
implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a set of tests in MetricAggregatorManager that tests these indirectly, however, I can (and likely should) add MA equality tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests added and existing ones are refactored. Overall looks a lot cleaner.
/// <summary> | ||
/// Represents aggregator for a single time series of a given metric. | ||
/// </summary> | ||
internal class SimpleMetricStatisticsAggregator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reading this code, I felt that two aggregator classes, MetricAggregator
and SimpleMetricStatisticAggregator
are confusing. What if they were called Metric
and MetricAggregator
respectively? As a potential benefit, this naming would allow to simplify the IMetricProcessor
definition to Track(Metric, double value)
, which would also be more resilient to future changes in the Metric
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think about Metric/MetricAggregator naming...
You're almost right. There are two aggregators, so to speak. MetricAggregator is really a "handle" or a proxy to te actual aggregator. You can hold on to it aslong as you like and operating on it will be proxied to actual aggregator which is SimpleMetricStatisticsAggregator class.
Note that metric data may be aggregated in a number of ways. Extracting/computing simple stats is just a way. You can also do t-digest and some other fun stuff too. Lets chat more about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so how about calling them Metric
and SimpleMetricStatisticAggregator
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we have to settle on the names of three things. Currently I have MetricAggregator
, MetricAggregatorManager
(this one with CreateMetricAggregator
method) and SimpleMetricStatisticsAggregator
.
I guess you're suggesting the following naming schema: Metric
, MetricManager
(with CreateMetric
method) and SimpleMerticStatisticsAggregator
(no change on the last one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitalyf007 Thanks for the pull request. It's great to see the metric aggregation making its way to the core SDK. I have a couple of concerns about implementation of the new AggregatedMetricTelemetry
class and an alternative naming for the MetricAggregator
I would like you to consider. The remaining comments are small implementation concerns and suggestions. Could you take a look at my comments?
cc: @SergeyKanzhelev
|
||
if (aggergatedMetricTelemetry != null) | ||
{ | ||
this.Client.Track(aggergatedMetricTelemetry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just thought about it - I'd suggest to set an sdkVersion here different from just dotnet: XX.XX.XX
. Something like agg-net: XX.XXX.XX
so you'll now how many metrics were collected using this aggregation mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting setting a special tag for this specific way of aggregating such that we can tell how many people use it? I do not think version of the Sdk is the right way though...
/// <returns>Metric telemetry object resulting from aggregation.</returns> | ||
private static AggregatedMetricTelemetry CreateAggergatedMetricTelemetry(MetricAggregator aggregator, SimpleMetricStatisticsAggregator aggregatorStats) | ||
{ | ||
if ((aggregator == null) || (aggregatorStats == null) || (aggregatorStats.Count <= 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the aggregator
or aggregatorStats
be ever null
here when they come from this.aggregatorDictionary
protected by the logic in GetStatisticsAggregator
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They cannot be, correct. If favor of failing fast, I'll remove the check.
{ | ||
AggregatedMetricTelemetry aggergatedMetricTelemetry = CreateAggergatedMetricTelemetry(aggregatorWithStats.Key, aggregatorWithStats.Value); | ||
|
||
aggergatedMetricTelemetry.Properties.Add(intervalDurationPropertyName, ((long)aggregationIntervalDuation.TotalMilliseconds).ToString(CultureInfo.InvariantCulture)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregatedMetricTelemetry
can be null
here if the aggregator's Count
is 0
. We should add the intervalDurationPropertyName
inside the if
statement below. Ideally, I think the CreateAggregateMetricTelemetry
should always return a valid object and Snapshot
should check for Count == 0
instead of null
to make it clear why the AggregatedMetricTelemetry
is not created and not tracked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:( Good catch. Fixed
} | ||
catch (Exception ex) | ||
{ | ||
CoreEventSource.Log.FailedToSnapshotMetricAggregators(ex.ToString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TaskTimer
already logs unhandled exceptions. Do we really need the additional logging of FailedToSnapshotMetricAggregators
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have a specific message logged here in addition to more generic "timed task failed" message logged by timed task.
throw new ObjectDisposedException(nameof(MetricAggregatorManager)); | ||
} | ||
|
||
if (aggregator == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The GetStatisticsAggregator
is only called by MetricAggregator.Track
. Can aggregator
be ever null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Will remove the check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vitalyf007 Thanks for addressing my previous comments. I've completed a second review pass, unfortunately only for product code this time as I need to cut this short today. I think we have to address the potential NullReferenceException
in the Snapshot
method of the MetricAggregatorManager
. Other than that, I left a few comments in places where I found what looks like defensive code and would prefer it to fail fast instead. Plus some questions about tests and style (it's my OCD, I know, I just can't help it!) I feel strongly only about the NullReferenceException
problem. Please ping me if/when you want me to do another review pass.
/// Represents mechanism to calculate basic statistical parameters of a series of numeric values. | ||
/// </summary> | ||
internal class SimpleMetricStatisticsAggregator | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have note yet read the whole CR so I may learn this shortly (sorry if that's the case), but what do you use these getters for and how important is it that they give consistent values? You may want to put them under lock too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So having read the other bits, the only way this can be a problem is when Track was called right before a Snapshot is made and then lost the race. We would expose the spinlock internally and lock the aggregator while making the snapshot. Because the race is very rare, the lock will be contended very rarely and so should not result in any significant perf decrease.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, given the read pattern (once right after snapshot) I do not think this is a problem. I do not want to complicate getters with honoring the lock especially considering the fact that they'll be called sequentially.
return false; | ||
} | ||
|
||
return string.Compare(this.aggregatorId, other.aggregatorId, StringComparison.Ordinal) == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Implement Equals using Equals (of the aggregatorId)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
private static string GetAggregatorId(string name, IDictionary<string, string> dimensions = null) | ||
{ | ||
StringBuilder aggregatorIdBuilder = new StringBuilder(name ?? string.Empty); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a different string constant than "" here. This might help debugging. Can name actually be empty? If yes, we will confuse it with a null name. Maybe "MetricNameIsNull" or somthing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will use "n/a" as that is the way Sdk represents other names (such as metric name, event name) during serialization if the name is null.
|
||
foreach (KeyValuePair<string, string> pair in sortedDimensions) | ||
{ | ||
aggregatorIdBuilder.AppendFormat(CultureInfo.InvariantCulture, "\n{0}\t{1}", pair.Key ?? string.Empty, pair.Value ?? string.Empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using /n and /t as separators. Are we sure that they cannot be part of the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can appear in the name but I do not expect to ever see two real cases of that. Metric names, dimension names and values appear in the UX and generally speaking expected to be symbols and spaces only...
|
||
if (metricProcessors != null) | ||
{ | ||
foreach (IMetricProcessor processor in metricProcessors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hot path, right?
If so, maybe a For Loop is more performant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good catch. Thanks.
{ | ||
throw new ObjectDisposedException(nameof(MetricManager)); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
race on disposed. if you are sure it is benign, please add a comment so that the readers know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
|
||
// calculate aggregation interval duration interval | ||
TimeSpan aggregationIntervalDuation = DateTimeOffset.UtcNow - this.lastSnapshotStartDateTime; | ||
this.lastSnapshotStartDateTime = DateTimeOffset.UtcNow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call DateTimeOffset.UtcNow once and store in a local. Otherwise you have the situation where lastSnapshotTime does not directly correspond to the end of the previous period.
|
||
// adjust interval duration to exactly snapshot frequency if it is close (within 1%) | ||
double difference = Math.Abs(aggregationIntervalDuation.TotalMilliseconds - snapshotFrequency.TotalMilliseconds); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you pick 1%?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it the period is 1 minute, 1% is 0.6 secs. For 5 mins aggregations, it is 3 secs. I am not sure that this is what you want. Maybe 1% for under period < 10 secs; 0.5 sec for 10 sec < period <= 30 sec; 1 sec for 30 sec < period <= 3 min; and 3 secs otherwise. This is also random, bu i think it comes closer to what you want. Whatever you choose, please explain it in a comment.
{ | ||
AggregatedMetricTelemetry aggergatedMetricTelemetry = CreateAggergatedMetricTelemetry(aggregatorWithStats.Key, aggregatorWithStats.Value); | ||
|
||
if (aggergatedMetricTelemetry.Count > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use (aggregatorWithStats.Value.Count > 0) and mode the check above the telemetry ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if (aggregationIntervalDuation.TotalMilliseconds < 1) | ||
{ | ||
aggregationIntervalDuation = TimeSpan.FromMilliseconds(1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is not thread safe, because you are first resetting a snapshot and then resetting the start time value. So it can be that a subsequent call incorrectly snapshots a few values (more recent dictionary) with a long time period.
Now, if you know that this is called "not too frequently" than this is not a problem. But here you are handling the case where this can be called more often than once a millisec, in which case it is a problem.
You could group the dictionary and the time start value into one object and just swap out that.
/// Metric factory and controller. | ||
/// </summary> | ||
public sealed class MetricManager : IDisposable | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I am still reading this, but I really do not like that this guy is disposable.
What are the circumstances where you create more than one MetricManager?
It seems to only be disposable because of the timer. I suggest to look for ways around that. Maybe I can suggest something after I finish reading everything...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at TaskTimer. It looks like just Task.Delay plus some really weird cancellation logic that looks really suspicious to me. I would not use it. Can we chat to see wht was intended and why directly using Task.Delay is not an option (or is it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking further, we throw in disposed state for methods, but not properties. Is this right. Most methods, actually, would work fine in the disposed state, the only thing we cannot do is schedule future updates. Can we allow those calls after Dispose()? Can we get rid of IDisposable? I think we should chat. :)
/// <summary> | ||
/// Reporting frequency. | ||
/// </summary> | ||
private static TimeSpan snapshotFrequency = TimeSpan.FromMinutes(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aggregationPeriod would be a better name. frequency is a rate, but you are using a time span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
TimeSpan sleepTime = nextWakeTime - DateTimeOffset.UtcNow; | ||
|
||
// adjust wait time to a bit longer than a minute if the wake up time is within few seconds from now | ||
return sleepTime < TimeSpan.FromSeconds(3) ? sleepTime.Add(snapshotFrequency) : sleepTime; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are not adding a minute, but the period. Not sure what's intended, but please make the comment and code match.
/// Calculates wait time until next snapshot of the aggregators. | ||
/// </summary> | ||
/// <returns>Wait time.</returns> | ||
private static TimeSpan GetWaitTime() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is really hard to read, and I think it is also not correct. if your aggregation period is >= 1 minute, your 3 seconds rule will never apply. And is it is small, then you may still end up in the past.
Consider this as an alternative:
private static TimeSpan GetWaitTime()
{
// const:
TimeSpan aggregationPeriod = TimeSpan.FromMinutes(1);
DateTimeOffset now = DateTimeOffset.UtcNow;
TimeSpan waitPeriod = GetNextSnapshitTime(now, aggregationPeriod) - now;
return waitPeriod;
}
private static DateTimeOffset GetNextSnapshitTime(DateTimeOffset prevTime, TimeSpan aggregationPeriod)
{
DateTimeOffset exactWaitUntilTime = prevTime + aggregationPeriod;
// Below we have logic that rounds the next snapshot event to a sertain number of seconds past a minute.
// This would break down if the aggregation period was smaller than a minute.
// So, for small aggegation periods we are ot rounding.
if (aggregationPeriod < TimeSpan.FromMinutes(1))
{
return exactWaitUntilTime;
}
// We want to snapshot at 1 sec past full minute.
// Why??
const int pastMinSecs = 1;
DateTimeOffset roundedWaitUntilTime = new DateTimeOffset(exactWaitUntilTime.Year, exactWaitUntilTime.Month, exactWaitUntilTime.Day,
exactWaitUntilTime.Hour, exactWaitUntilTime.Minute, pastMinSecs,
exactWaitUntilTime.Offset);
// We round to the closest minute boundry (Attention, adjust this condition if NOT (0 <= pastMinSecs < 30) !! )
if (exactWaitUntilTime.Second > (30 + pastMinSecs))
{
roundedWaitUntilTime = roundedWaitUntilTime.AddMinutes(1);
}
return roundedWaitUntilTime;
}
this.Flush(true); | ||
} | ||
|
||
internal SimpleMetricStatisticsAggregator GetStatisticsAggregator(Metric metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under what circumstances is the can the manager be disposed. GetStatisticsAggregator is called from the public Track in metric. Should Track ever throw? And if it may, then is this relationship obvious to the user?
Note that this method does not include any logic that would stop working once the manager is disposed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. I do not think Track() should throw... In this particular case it is somewhat bad because one can create manager, create metric, dispose Manager, continue using Metric.Track() expecting things to happen...
On the other had throwing exceptions from monitoring code is considered bad...
Based on further consideration I think I will remove places where I throw "object disposed" exceptions. This will break scenario I spoke of just above, but I think not throwing is better...
/// Takes a snapshot of aggregators collected by this instance of the manager | ||
/// and schedules the next snapshot. | ||
/// </summary> | ||
private Task SnapshotAndReschedule() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having examined the DelayTask (I am not sure whether I agree with Oleg about using it, but he makes some good points), I think that running is underStartNew is no required. DelayTask calls Task.Delay and then runs the delegate synchronously in respect to the completion of the Delay. Since we are not scheduling the timer again until we finsih the snapshot, there is no reason for us to just threads here.
I would just do
return () =>
{
// all the shapshot stuff
return Task.FromResult(1);
};
@@ -26,6 +26,7 @@ public sealed class TelemetryConfiguration : IDisposable | |||
private string instrumentationKey = string.Empty; | |||
private bool disableTelemetry = false; | |||
private TelemetryProcessorChainBuilder builder; | |||
private SnapshottingList<IMetricProcessor> metricProcessors = new SnapshottingList<IMetricProcessor>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this SnapshottingList thing. What invention is that. It seems to be intended to provide thread safety in cases where reads are frequent and writes are rare. It does it at the expense of 2x memory and still locking for writes (plus subsequent complete copy). Why not use concurrent collections or a reader/writer lock?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reusing what's already available. I do not like it very much either but do not intend to fix that particular problem in this effort.
this.Track(telemetry); | ||
} | ||
|
||
/// <summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to see a flat API here:
TrackAggregatedMetric(string metricName, double value);
TrackAggregatedMetric(string metricName, double value, params string[] dimensions);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was discussed multiple times and we decided not to have these apis in order to treat TelemetryClient as context.
This PR contains all proposed changes for AI SDK public surface to support metric aggregation.