Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation hooks and Prometheus metrics #299

Closed
wants to merge 1 commit into from

Conversation

mwitkow
Copy link
Contributor

@mwitkow mwitkow commented Aug 18, 2015

Implements a simple callback-based instrumentation hooks for gRPC. The choice of instrumentation is made through server.options and clinetconn.DialOption, leaving the user in full control. The default implementation is a No-Op, incurring no overhead.

The Prometheus implementation counts:

  • number of started RPCs (to be able to see RPCs in flight)
  • number of fully handled RPCs (to the app layer) with breakdown by statusCode, including latency measurements
  • number of erred RPCs in the RPC-layer
  • number of sent/received messages for streaming RPCS

The serverside screenshot of the metrics page of prometheus showign both streaming and unary rpcs is in the related bug: #240

Input needed:

  • naming for prometheus
  • testing?

@mwitkow mwitkow mentioned this pull request Aug 18, 2015
@@ -262,6 +276,14 @@ func (s *Server) processUnaryRPC(t transport.ServerTransport, stream *transport.
}
}()
}
monitor := s.opts.serverMonitor.NewServerMonitor(monitoring.Unary, stream.Method())
defer func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious: Have you benchmarked for throughput the overhead of the additional defer call on this procedure and elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't. But the gRPC codebase seems to be full of defers. Should I put a if monitor.(type) == monitoring.NoOpMonitor?

@mwitkow
Copy link
Contributor Author

mwitkow commented Aug 19, 2015

Added the client side stream monitoring, but I am not entirely correct if I got all the edge cases. @iamqizhao can you take a look? Even if this is not gonna make it upstream, we still intend to use it internally :)

@pires
Copy link

pires commented Aug 20, 2015

This is awesome! Looking for it to get merged so I can implement a sink to InfluxDB.

"google.golang.org/grpc/codes"
)

type RpcType string
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, see later ocmment.

@miekg
Copy link

miekg commented Aug 20, 2015

On the whole this looks good and something worth having IMHO.

@@ -111,6 +112,7 @@ func Invoke(ctx context.Context, method string, args, reply interface{}, cc *Cli
return toRPCErr(err)
}
}
monitor := cc.dopts.clientMonitor.NewClientMonitor(monitoring.Unary, method)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cc.dopts.clientMonitor != nil {
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed below.

@iamqizhao
Copy link
Contributor

The general design looks good to me.

@mwitkow
Copy link
Contributor Author

mwitkow commented Sep 21, 2015

Glad to hear that right after coming back from vacation :)

Two things to discuss:

  • would you prefer the NoOpMonitor to be removed and substituted with if monitor != nil? There's hardly any allocation done per-RPC so the performance impact should be negligible, but the code is clearer.
  • would you prefer a collapsed ClientMonitor and ServerMonitor interfaces? Their signatures would be the same.

@mwitkow mwitkow force-pushed the monitoring_take_i branch 2 times, most recently from 5c83df7 to 3598371 Compare September 28, 2015 16:57
@mwitkow
Copy link
Contributor Author

mwitkow commented Sep 28, 2015

@iamqizhao I have refactored the PR to:

  • remove NewClientMonitor and NewServerMonitor to be unified and have the same signature
  • rebase on top of current master, which includes the Picker changes that broke

Could you PTAL? :)

@mwitkow
Copy link
Contributor Author

mwitkow commented Oct 2, 2015

@iamqizhao, I understand that you guys have tons of other work. It would be at least useful to know whether this PR is considered for acceptance?

We're thinking about relying on it for our SLO monitoring (using different error.Codes to differentiated between user faults and our system faults), and knowing whether this PR has a chance of being upstreamed would be incredibly useful.

@iamqizhao
Copy link
Contributor

yep, this PR can be accepted. But I need to check all the points you inserted the code and have not got time to do that. Sorry about the delay.

BTW, I know it is painful but can you sync your code to the latest?

@mwitkow mwitkow force-pushed the monitoring_take_i branch from 3598371 to 2bef982 Compare October 5, 2015 07:44
@mwitkow
Copy link
Contributor Author

mwitkow commented Oct 5, 2015

@iamqizhao Great to hear that. I have rebase over the latest master.

I'm wondering how to add unit tests that will make the monitoring still work correctly across major refactors such as the ones I'm currently rebasing on. Maybe retrofit some integration tests with monitoring counters?

@iamqizhao
Copy link
Contributor

On Mon, Oct 5, 2015 at 12:46 AM, Michal Witkowski [email protected]
wrote:

@iamqizhao https://github.com/iamqizhao Great to hear that. I have
rebase over the latest master.

FYI, I will consider merging this change after the WIP naming and load
balancing change is done. Sorry about that because this probably means you
need to do a couple of extra rebasing in the next a couple of weeks.

I'm wondering how to add unit tests that will make the monitoring still
work correctly across major refactors such as the ones I'm currently
rebasing on. Maybe retrofit some integration tests with monitoring counters?

I would suggest creating a dummy monitoring impl in the end2end test and
verify it works properly.


Reply to this email directly or view it on GitHub
#299 (comment).

@mwitkow
Copy link
Contributor Author

mwitkow commented Oct 5, 2015

That's ok :) LB and naming is something we're incredibly interested in (and willing to put some manpower behind DNS SRV implementation).

Can you provide some pointers regarding how to implement the things in the end2end test?

@mwitkow
Copy link
Contributor Author

mwitkow commented Nov 10, 2015

@iamqizhao, any updates here? we've been happily using this monitoring for the last 2 months in prod and are willing to help out getting it upstreamed :)

@mwitkow
Copy link
Contributor Author

mwitkow commented Feb 12, 2016

@iamqizhao, any updates on the metrics API? We've been happily using this in prod for a while now :)

@raliste
Copy link

raliste commented Mar 18, 2016

+1 for updates here

Actively using grpc at preyproject.com

@rolandshoemaker
Copy link
Contributor

@iamqizhao has this been depreciated by the server side interceptor currently being reviewed internally + proposals for client side interceptors?

The Boulder team at Let's Encrypt is currently working on moving away from our current RPC implementation in favor of gRPC but the lack of exposed metrics hooks is slowing us down somewhat (one of the reasons for moving to gRPC was to get rid of a bunch of non-CA code we had to maintain, including various hacks to collect client and server side metrics which we'd rather not re-implement).

We'd prefer to use something native to grpc-go rather than writing another set of wrappers/using additional code generators and are wondering if there is a prospect of this being merged or if the other proposals have superseded it (and in either case if you have a ETA, even if it is likely to shift).

@iamqizhao
Copy link
Contributor

The ETA is by the end of this week or early next week. Sorry about the
delay. The internal review took much longer than I expected.

On Wed, Apr 13, 2016 at 3:33 PM, Roland Bracewell Shoemaker <
[email protected]> wrote:

@iamqizhao https://github.com/iamqizhao has this been depreciated by
the server side interceptor currently being reviewed internally + proposals
for client side interceptors?

The Boulder team at Let's Encrypt is currently working on moving away from
our current RPC implementation in favor of gRPC but the lack of exposed
metrics hooks is slowing us down somewhat (one of the reasons for moving to
gRPC was to get rid of a bunch of non-CA code we had to maintain, including
various hacks to collect client and server side metrics which we'd rather
not re-implement).

We'd prefer to use something native to grpc-go rather than writing
another set of wrappers/using additional code generators and are wondering
if there is a prospect of this being merged or if the other proposals have
superseded it (and in either case if you have a ETA, even if it is likely
to shift).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#299 (comment)

@mwitkow
Copy link
Contributor Author

mwitkow commented May 13, 2016

I moved this implementation into a server-side interceptor under:
https://github.com/mwitkow/go-grpc-prometheus

@mwitkow mwitkow closed this May 13, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants