Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to TransactionManager #3677

Merged
merged 4 commits into from
Nov 29, 2017
Merged

Conversation

xiazen
Copy link
Contributor

@xiazen xiazen commented Nov 17, 2017

Added some metrics which I think will help us understand transaction more.
Below is some thinking on how we can use those metrics:

  • TransactionManager.StartTransaction.TPS combined with TransactionManager.CommitTransaction.TPS can give us some insights on how slow is prepare phase, and also whether TM is at its peak capacity ATM or not.

  • AbortedTransactionTPS , PendingTransactionTPS , ValidatedTransactionTPS and AbortedTransactionDueToDependencyTPS, AbortedTransactionDueToMissingInfoInTransactionTableTPS , can give us some insights on how TM handles complicated, harsh situation, such as lots of dependent transaction in the system. And also, what's the reason for transaction to be aborted.

@xiazen xiazen requested a review from ReubenBond November 17, 2017 22:02

namespace Orleans.Transactions
{
internal class TransactionManagerMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sense some duplicated code with TransactionAgentMetrics. Any reason to not have a common implementation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure is similar, but metrics are different. They can share a common base class. I can improve that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I look into this direction, and have some more thoughts . These two metrics are similar in structure because they all use PeriodicAction to trigger metric recording, which is already made modular to increase code reuse. Logic other than PeriodicAction class are only similar in terms of function name, the actual logic in the function is different for each metric class. So not much upside on making a common implementation.

And to make a common implementation, we need to have a public class in Orleans runtime. I'm wary on this one, since I don't want to signal our users this is the recommended way to produce metrics. There's many many ways, and it is up to them on which way they choose to use.


namespace Orleans.Transactions
{
internal class TransactionManagerMetrics
{
private const string StartTransactionRequestTPS = "TransactionManager.StartTransaction.TPS";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you followed what was done in TransactionAgentMetrics, could these string values be useful for the developer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent is one per silo, this TransactionManager is a grain. Agent send batch request to TM to start transactions, and commit transactions. So even the metric name maybe similar as the one in AgentMetrics, I think they are measuring different metrics. It isn't based on AgentMetrics.

I think it will be useful to developer. For example, when we see transaction are not being processed fast enough, we can look at TM metrics to see what's the TPS handling here, if at peak, maybe the bottle neck is TM. Otherwise, the bottle neck is else where.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it they are useful for the dev, should we put these as public somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see, this is similar issue with all of our internal metrics, there's no easy way to figure out what metric string are reporting. We can have a Metrics.cs file listing all the metrics in each assembly.

//TPS for validated transaction in the monitor window
private const string ValidatedTransactionTPS = "Transaction.Validated.TPS";

internal long StartTransactionRequestCounter { get; set; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you accessing long values directly instead of exposing (safer?) increment method for performance reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TransactionManager class is used only in TransactionManagerGrain, so it is single threaded. hence I wasn't concerned by threading issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These either need to be fields which are incremented via Interlocked.Increment or we need to guard them via a lock. The TransactionManager class might be misleading: it sits in a folder called InClusterTM, but it's actually used by both grain & external TM. ++ wont give us accurate results - it might in some cases spike the result by skipping the reset to 0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not concerned about threading issues here, it just feel not natural to deal directly with long here. I think having IncrementXXX method would be a better design. This way someone cannot decrement the values...

@@ -9,6 +9,8 @@ namespace Orleans.TestingHost.Utils
/// </summary>
public class NullTelemetryProducer : ITelemetryProducer
{
public static NullTelemetryProducer Instance = new NullTelemetryProducer();
Copy link
Member

@ReubenBond ReubenBond Nov 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a get-only property

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by extension, make the constructor private :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I thought of that. But then I looked at example at NullLoggerFactory, it has an Instance property, but it also has a public constructor. I guess it will be nice to have the option to construct one.

I don't have strong option either way, just mention my thinking behind this. And if anyone know the benefits of have a public constructor in this case.


namespace Orleans.Transactions
{
internal class TransactionManagerMetrics
{
private const string StartTransactionRequestTPS = "TransactionManager.StartTransaction.TPS";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind-of prefer PerSecond instead of TPS. TPS might be a bit confusing


namespace Orleans.Transactions
{
internal class TransactionManagerMetrics
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer this has its own file

return;
var now = DateTime.Now;
var timeSinceLastReportInSeconds = Math.Max(1, (now - this.lastReportTime).TotalSeconds);
var startTransactionTps = StartTransactionRequestCounter / timeSinceLastReportInSeconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to do this without all the code? I would have thought that we have some kind of 'per time' metric. Not pushing back on it, but it seems easy to mess up with all the duplicated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our internal per time metrics/statistic system has bugs in it. I prefer not use it if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this class is in Orleans.Transaction project. And those metric/statictis system are internal to Orleans core, if I recall correctly.

var validatedTransactionTPS = ValidatedTransactionCounter / timeSinceLastReportInSeconds;
this.telemetryProducer.TrackMetric(ValidatedTransactionTPS, validatedTransactionTPS);

var abortedTransactionTPS = AbortedTransactionCounter / timeSinceLastReportInSeconds;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To do this accurately (I'm not sure if that's a concern for us), we need to read the field using Interlocked.Exchange to simultaneously read the value and reset it to 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this class is only used in grain based TM, so it is single threaded. I assume I don't need to handle threading issue there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used in grain based TM because the external TM isn't merged? I think we have only one TransactionManager class which is wrapped either in a grain or not depending on where it's being hosted. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's only one grain based TM in Orleans master I think. The external TM is not merged in, and I think it may/will not be in the future either AFAIK.

Yes it is only used in the grain based TM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one grain-based TM, but this is not a grain-based TM. This component is shared between both implementations - that's why all of the collections are concurrent collections

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, there's only one TM, which is the grain based one. And this component is only used the the grain based one. If it will be used in another TM which is not single threaded in the future, let's take care of that in the future. But I don't see why we should take that unknown future into account now.

All the collection are concurrent collection probably because of the old external TM. But that TM is not merged in, and I doubt it will ever be. Hence my conclusion on it is safe to not use Interlocked.Exchange in this component.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem - as long as @jason-bragg is ok with it. We can eventually migrate to non-concurrent collections and improve perf (a large part of the memory overhead was in ConcurrentDictionary/ConcurrentQueue maintenance)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jason-bragg this maybe lost in diff updates. But here is the place we need your confirmation too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no plan at this time to pull in the external TM. The next major architectural change will be to move to a distributed TM, not an external TM. As we clean up this TM, the design decisions made due to multi-threaded access will be removed. Interlocked.Exchange should not be necessary.

this.lastReportTime = lastReportTime;
this.StartTransactionRequestCounter = 0;
this.AbortTransactionRequestCounter = 0;
this.CommitTransactionRequestCounter = 0;
Copy link
Member

@ReubenBond ReubenBond Nov 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand why these 3 properties aren't past tense while the properties below are past tense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for example, CommitTransactionRequest is a word, it is the counter for requests.

AbortedTransaction is a word, it is the counter for transaction which are aborted. Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't AbortTransactionRequestCounter always the same as AbortedTransactionCounter, since they're always at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, they are counting the same thing. It is clear to you because you read the code. But think in the aspect of people who use this metric. They would categorize them into two category, request counter related metric and transaction status related metrics. It may not be clear to them in the first sight, without reading the code, that AbortedTransactionRequestCOunter, and AbortedTransactionCounter is the same thing.

I maybe over-thinking on this, but just share my thought on this.

public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, TimeSpan? logMaintenanceInterval = null)
private TransactionManagerMetrics metrics;
public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, ITelemetryProducer telemetryProducer,
Factory<NodeConfiguration> getNodeConfig, TimeSpan? logMaintenanceInterval = null)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably out of scope for this PR, but should logMaintenanceInterval be a part of TransactionsConfiguration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah you are probably right, not sure why it isn't now. @jason-bragg for more opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding logMaintenanceInterval, unclear. I don't, at this time, think it should be configurable.

@xiazen
Copy link
Contributor Author

xiazen commented Nov 21, 2017

@dotnet-bot test functional please

public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, TimeSpan? logMaintenanceInterval = null)
private TransactionManagerMetrics metrics;
public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, ITelemetryProducer telemetryProducer,
Factory<NodeConfiguration> getNodeConfig, TimeSpan? logMaintenanceInterval = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding logMaintenanceInterval, unclear. I don't, at this time, think it should be configurable.

public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, TimeSpan? logMaintenanceInterval = null)
private TransactionManagerMetrics metrics;
public TransactionManager(TransactionLog transactionLog, IOptions<TransactionsConfiguration> configOption, ILoggerFactory loggerFactory, ITelemetryProducer telemetryProducer,
Factory<NodeConfiguration> getNodeConfig, TimeSpan? logMaintenanceInterval = null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest getting IStatisticsConfiguration rather than NodeConfig. We only need the statistics configuration. This may require registering the NodeConfiguration as an IStatisticsConfiguration in DI. Also, not clear why getNodeConfig is a factory function?

Copy link
Contributor Author

@xiazen xiazen Nov 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree if we were not moving toward to option configuration. In my recollection, there should be a StatisticsOption , or some sort , as the new recommended way to configure statistics. And when that happens, this factory func will be replaced by an IOption<'StatisticOption'> and injected from DI.

I don't think we should add more reference of old legacy config interfaces to Orleans core now.

@sergeybykov sergeybykov added this to the 2.0.0 milestone Nov 28, 2017
@jason-bragg jason-bragg merged commit 7d00d5b into dotnet:master Nov 29, 2017
@xiazen xiazen deleted the tm-metrics branch March 1, 2018 01:44
@github-actions github-actions bot locked and limited conversation to collaborators Dec 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants