Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instrumentation hooks #240

Closed
mwitkow opened this issue Jul 8, 2015 · 45 comments · Fixed by #867
Closed

Instrumentation hooks #240

mwitkow opened this issue Jul 8, 2015 · 45 comments · Fixed by #867

Comments

@mwitkow
Copy link
Contributor

mwitkow commented Jul 8, 2015

We're currently experimenting with GRPC and wondering how we'll monitor the client code/server code dispatch using Prometheus metrics (should look familiar ;)

I've been looking for a place in the grpc-go to be able to hook up gathering of ServiceName, MethodName, bytes, latency data, and found none.

Reading upon the thread in #131 about RPC interceptors, it is suggested to add the instrumentation in our Application Code (a.k.a. the code implementing the auto-generated Proto interfaces). I see the point about not cluttering grpc-go implementation and being implementation agnostic.

However, adding instrumentation into Application Code means that either we need to:
a) add a lot of repeatable code inside Application Code to handle instrumentation
b) use the callFoo pattern described in #131 proposed [only applicable to Client]
c) add a thin implementation of each Proto-generated interface that wraps the "real" Proto-generated method calls with metrics [only applicable to Client]

There are downsides to each solution though:
a) leads to a lot of clutter and errors related to copy pasting, and some of these will be omitted or badly done
b) means that we loose the best (IMHO) feature of Proto-generated interfaces: the "natural" syntax that allows for easy mocking in unit tests (through injection of the Proto-generated Interface), and is only applicable on the Client-side
c) is very tedious because each time we re-generate the Proto (add a method or a service) we need to go and manually copy paste some boiler plate. This would be a huge drag on our coding workflow, since we really want to rely on Proto-generate code as much as possible. And also is only applicable on the Client-side.

I think that cleanest solution would be a pluggable set of callbacks on pre-call/post-call on client/server that would grant access to ServiceName, MethodName and RpcContext (provided the latter stats about bytes transferred/start time of the call). This would allow people to plug an instrumentation mechanism of their choice (statsd, grafana, Prometheus), and shouldn't have any impact on performance that interceptors described in #131 could have had (the double serialization/deserialization).

Having seen how amazingly useful RPC instrumentation was inside Google, I'm sure you've been thinking about solving this in gRPC, and I'm curious to know what you're planning :)

@iamqizhao
Copy link
Contributor

We use https://godoc.org/golang.org/x/net/trace to gather these stats. We
already have the minimal instrumentation for both unary and streaming RPCs
on client side. Can you take a look whether it meets your requirement. If
not, we can probably wrap it to allow custom impl.

On Wed, Jul 8, 2015 at 11:46 AM, Michal Witkowski [email protected]
wrote:

We're currently experimenting with GRPC and wondering how we'll monitor
the client code/server code dispatch using Prometheus metrics
http://prometheus.io/docs/concepts/data_model/ (should look familiar ;)

I've been looking for a place in the grpc-go to be able to hook up
gathering of ServiceName, MethodName, bytes, latency data, and found none.

Reading upon the thread in #131
#131 about RPC interceptors, it
is suggested to add the instrumentation in our Application Code (a.k.a. the
code implementing the auto-generated Proto interfaces). I see the point
about not cluttering grpc-go implementation and being implementation
agnostic.

However, adding instrumentation into Application Code means that either we
need to:
a) add a lot of repeatable code inside Application Code to handle
instrumentation
b) use the callFoo pattern described in #131
#131 proposed [only applicable to
Client]
c) add a thin implementation of each Proto-generated interface that wraps
the "real" Proto-generated method calls with metrics [only applicable to
Client]

There are downsides to each solution though:
a) leads to a lot of clutter and errors related to copy pasting, and some
of these will be omitted or badly done
b) means that we loose the best (IMHO) feature of Proto-generated
interfaces: the "natural" syntax that allows for easy mocking in unit tests
(through injection of the Proto-generated Interface), and is only
applicable on the Client-side
c) is very tedious because each time we re-generate the Proto (add a
method or a service) we need to go and manually copy paste some boiler
plate. This would be a huge drag on our coding workflow, since we really
want to rely on Proto-generate code as much as possible. And also is only
applicable on the Client-side.

I think that cleanest solution would be a pluggable set of callbacks on
pre-call/post-call on client/server that would grant access to ServiceName,
MethodName and RpcContext (provided the latter stats about bytes
transferred/start time of the call). This would allow people to plug an
instrumentation mechanism of their choice (statsd, grafana, Prometheus),
and shouldn't have any impact on performance that interceptors described in
#131 #131 could have had (the
double serialization/deserialization).

Having seen how amazingly useful RPC instrumentation was inside Google,
I'm sure you've been thinking about solving this in gRPC, and I'm curious
to know what you're planning :)


Reply to this email directly or view it on GitHub
#240.

@mwitkow
Copy link
Contributor Author

mwitkow commented Jul 9, 2015

Trace looks amazing! I'm really happy that go-rpc will have RPC debug pages straight out of the box :)

However, the trace.Trace interface doesn't expose access to the Family, Title, Start orElapsedfields of the privatestruct trace.trace. Nor does it have the ability to add callbacks onNewTraceandTrace.Finished()` to increment/decrement relevant metric counters.

gRPC already has traceInfo struct, that is used here:
/home/michal/code/go/src/google.golang.org/grpc/call.go:121

    if EnableTracing {
        c.traceInfo.tr = trace.New("Sent."+methodFamily(method), method)
        defer c.traceInfo.tr.Finish()

However, if you were wrapping trace, it would mean that the actual trace.trace data would be duplicated.

I think it may be better to extend trace package to have such callbacks.

@mwitkow
Copy link
Contributor Author

mwitkow commented Jul 9, 2015

Also, it seems that the traceInfo is only currently rolled out on the Client side, (call.go, stream.go) but not on the Server. Any idea where that could be plugged in? :)

@dsymonds
Copy link
Contributor

The trace package is not the place for hooks. It is the reporting mechanism.

I think #131 hashed out the stance on hooks/interceptors in grpc-go. We are very nervous about permitting hooks because a badly behaved hook has so much ability to drag things down. I would advocate you start with your option (a) for now.

(If anyone's at GopherCon, I'm happy to have a sit down and chat about the options tomorrow.)

@juliusv
Copy link

juliusv commented Jul 10, 2015

@dsymonds I'm at GopherCon and I'm going to be at the Prometheus hackday meetup at 14:00 in the CoreOS room along with a bunch of other Prometheans and friends (or happy to chat elsewhere). I haven't used grpc yet, but it could be interesting to talk about this more. I wonder if @peterbourgon from https://github.com/go-kit/kit has thoughts about this as well.

I guess this is mostly a matter of judgement. I would also really like to see instrumentation options that are less cumbersome than option a). For hooks I think it's a fair expectation that users have to write well-behaved hooks or suffer the consequences. At least that seems like pretty straightforward, non-surprising behavior. But I also appreciate being conservative about what features to add, and being really careful about not giving users any rope to hang themselves with.

@peterbourgon
Copy link

Please ping me if you sit down together. I'd like to at least listen :)

@xiang90
Copy link
Contributor

xiang90 commented Jul 10, 2015

@mwitkow-io @juliusv @dsymonds

Here is the result of #131: we use invoke directly and so bypass the generated wrapper at client side in order to add our own hooks and intercept the rpc calls.

@dsymonds
Copy link
Contributor

I'm a little tied to the Google room, but maybe we can catch up later in the afternoon?

@juliusv
Copy link

juliusv commented Jul 10, 2015

@dsymonds @peterbourgon Sounds good - best channel to reach you (I know Twitter works for Peter)? Want to stop spamming the issue here :)

@dsymonds
Copy link
Contributor

Email me at dsymonds at golang.org.

@juliusv
Copy link

juliusv commented Jul 10, 2015

@dsymonds ack, thanks!

@juliusv
Copy link

juliusv commented Jul 12, 2015

So we talked about this in an illustrous round of people (@dsymonds, @peterbourgon, multiple Prometheus people). A brief summary:

  • @dsymonds' concerns about hooks still hold. His worry is that misbehaving hooks will cause bad and surprising behavior. Given a mix of wrappers and frameworks, one also cannot expect that a hook would always be supplied by the final grpc user, making the problem worse. Further, such a hook would require quite a broad interface (offering all kinds of metadata) so that it would cover for many use cases, not just metrics.
  • Another mentioned alternative was to use an interceptor pattern (https://en.wikipedia.org/wiki/Interceptor_pattern), but that would suffer some of the same downsides.
  • It seemed @dsymonds was most open to an approach where metrics would be counted in a very primitive fashion by grpc itself, and then asynchronously (every couple of milliseconds or so) sent to a hook, in contrast to calling a hook in a blocking fashion whenever a request is made. This would require grpc to at least track request counters and duration sums, split along relevant dimensions. The upside would be that this way grpc could guarantee that the critical path only does very simple calculations. The downside is that everyone agreed that implementing anything more sophisticated than counters (like histograms or summaries) in grpc itself would be too complex. Further, this approach is pretty specific to metrics, and not really usable as a more general hook. The question arose whether such an approach would be useful enough at all, or if the average grpc user would still want to wrap each call then to get e.g. histograms.

If someone desires the latter approach, they should write a design document detailing the exact counters and dimensions needed, as well as what the asynchronous hook interface and backoff behavior should look like.

Overall, it seemed like nobody was really happy with any of the approaches, so it looks like this is going to be a situation where you can just focus on choosing the least bad solution...

@xiang90
Copy link
Contributor

xiang90 commented Jul 12, 2015

@juliusv

His worry is that misbehaving hooks will cause bad and surprising behavior.

Why cant we assume the user would do the correct thing?

It seemed @dsymonds was most open to an approach where metrics would be counted in a very primitive fashion by grpc itself

Won't this approach actually complicate gRPC the most? Or gRPC itself do not have to implement and maintain all of metrics. This sounds even more difficult than agreeing on a hook interface.

@juliusv
Copy link

juliusv commented Jul 13, 2015

@xiang90 As I understood it, the worry was that the end user of grpc might not even be the one supplying the hook, such as when using grpc as part of a framework or similar. While the approach where grpc itself maintains counters is more complex, it would ensure that only predictable, fast calculations would happen in the hot path.

FWIW, personally I agree that a normal hook interface would be the best (least bad) solution. I can see the above-mentioned downsides of it, but also think the other approaches are even nastier.

@Sajmani
Copy link
Contributor

Sajmani commented Aug 14, 2015

Perhaps we should consider a lossy subscription interface; something like:
package grpc
// Subscribe arranges for f to be called with RPC trace records.
// Records are dropped when the total size of outstanding TraceRecords for this subscription exceeds bufSize.
// The total number of dropped TraceRecords is reported in f's dropped parameter.
func Subscribe(bufSize int, f func(*TraceRecord, dropped int))
type TraceRecord struct {
Service, Method string
ID string // unique RPC identifier
Bytes int // size of this message
Request bool // Request is client->server, Response is server->client
Index int // zero for unary RPCs, sequence number for streaming RPCs
Elapsed time.Duration // since start of RPC
}

The implementation of grpc.Subscribe can avoid blocking on slow subscribers by buffering up to bufSize then dropping records. How the queuing is implemented and is up to grpc, which keeps the critical path under the package's control (in particular, we will eventually want to be able to mitigate lock contention on multiprocessors under very high load). This externalizes all the metric aggregation and reporting; we may event want to move the net/trace integration behind this API.

The main question I have is whether lossy reporting is OK for the kinds of metric aggregation and reporting that applications require. This API is not suitable for critical accounting (e.g., for billing), but it should be sufficient for monitoring and debugging.

A secondary question is what policy to use when dropping records: drop oldest, drop newest, drop random, something else. We don't want to get into the business of pushing filters into Subscribe; all filtering and aggregation should happen outside grpc.

Also: the only piece of this that's grpc-specific is TraceRecord. The rest could be provided by a separate Go package for building in-process publish-subscribe networks.

@mwitkow
Copy link
Contributor Author

mwitkow commented Aug 18, 2015

So, I decided to take a first stab at this in #299

gRPC seems to allow users to override the marshalling/unmarhsalling codec with pretty much whatever they want. As such, users can already significantly impact the performance of gRPC by deciding to use something wonky.

In the PR, I use a simple callback-based approach, that leaves a lot of flexibility to implement the monitoring as the user wishes. The choice of which implementation is used is made through server.options or client.DialOpts.

An example server-side implementation is provided for Prometheus, mimicked after the level of instrumentation one can find in Google servers by default. heres an example of a varz metrics page with examples of streaming and unary server-side instrumentation:

grpc_monitoring_server_side

@matttproud
Copy link
Member

@Sajmani, speaking from the perspective of internal, first-party use, I would want the fidelity of a lossless API (I am sure plenty of others would agree if you put the question to them). Most of the heretofore discussions around service level indicators (SLI) have not had the luxury to speak about things in terms of reporting error and just assume perfect measurement, for better or worse.

@iamqizhao
Copy link
Contributor

@Sajmani @matttproud As long as bufSize is configurable, I do not see a problem here. The user can configure a large bufSize if they need lossless data and the machine can afford it.

@mwitkow
Copy link
Contributor Author

mwitkow commented Aug 19, 2015

@iamqizhao even if bufSize is configurable, as far as I understand the proposal here, it would control the publication for both Tracing and Monitoring. As such, if I wanted full fidelity for Monitoring (which should be lightweight - incrementing a counter), I would have lossless Tracing (which can be expensive - writes to disks or something). The tracing could be fairly costly. Or alternatively I could disable Tracing altogether to retain Monitoring.

I tend to agree with @matttproud regarding lossless monitoring. As a SRE or a systems operator, I want to have full confidence that the counter I'm observing has been incremented on every RPC. If it was lossy, I probably want the monitoring to export a counter that said that a certain number of values was dropped... but that breeds a chicken and egg problem.

@iamqizhao
Copy link
Contributor

On Wed, Aug 19, 2015 at 3:30 AM, Michal Witkowski [email protected]
wrote:

@iamqizhao https://github.com/iamqizhao even if bufSize is
configurable, as far as I understand the proposal here, it would control
the publication for both Tracing and Monitoring. As such, if I wanted full
fidelity for Monitoring (which should be lightweight - incrementing a
counter), I would have lossless Tracing (which can be expensive - writes to
disks or something). The tracing could be fairly costly. Or alternatively I
could disable Tracing altogether to retain Monitoring.

My understanding is that tracing and monitoring can be completely separated
with different bufSize configured. Sameer just proposed the API. I do not
see why tracing and monitoring have to be bound together.

I tend to agree with @matttproud https://github.com/matttproud
regarding lossless monitoring. As a SRE or a systems operator, I want to
have full confidence that the counter I'm observing has been incremented on
every RPC. If it was lossy, I probably want the monitoring to export a
counter that said that a certain number of values was dropped... but that
breeds a chicken and egg problem.

+1 for lossless monitoring. But keep in mind that there is a physical limit
(memory size on that machine) anyways. You can set bufSize for monitoring
to infinite but it will OOM if the monitoring data exceeds the limit.


Reply to this email directly or view it on GitHub
#240 (comment).

@mwitkow
Copy link
Contributor Author

mwitkow commented Sep 5, 2015

@iamqizhao, what's your take about the change proposed in #299 ?

I know @matttproud commented on it, and @pires said that API would be great for adding InfluxDB support as well (another popular monitoring framework).

Given that https://groups.google.com/d/msg/grpc-io/K8Ar3wom5CM/ushngUvb5GYJ suggests that third party monitoring support is on the books, can we get some traction on the subject?

Or if the PR doesn't fit your needs, can you at least comment if the edge cases of client-side monitoring are covered in it correctly? We're already using it and I would sleep better knowing that it's correct.

@iamqizhao
Copy link
Contributor

@mwitkow-io

Sorry for the delay. I have got caught by some tight deadlines on some other stuffs these 2 weeks. I will have a close look early next week.

@therc
Copy link
Contributor

therc commented Nov 11, 2015

For the record, the C++ tree has early code for a Census implementation: https://github.com/grpc/grpc/tree/master/src/core/census

@inconshreveable
Copy link

Just an additional perspective here. grpc-go needs an interceptor/hook interface. The lack of an interceptor/hook interface makes grpc-go a complete non-starter for production applications.

Exposing metrics is not a reasonable compromise of a solution. The metrics important to me will be different from others and the implementation will undoubtedly be unable to satisfy everyone. There are other hooks that are unrelated to metrics that would be important to add like logging, authentication, panic reporting, etc.

Of those I know using go-grpc in production, one of the following is true:

  1. They have been forced to fork GRPC to work around this limitation, ultimately fracturing development effort and making them less likely to upgrade.
  2. They have been forced into code generation, polluting their code base with a wrapper for every service interface that injects logic into every call. This adds an additional build step and complexity and sophistication in writing such a tool.

While I sympathize with @dsymonds worry about misbehaving middleware, it is unfounded. It is akin to arguing that there should be no http.Handler interface in net/http because a misbehaving middleware could cause bad and surprising behavior. They certainly can, but most don't and they add a tremendous amount of value and important ecosystem.

@purpleKarrot
Copy link

Well written, @inconshreveable. I agree to every single point. I need an interceptor/hook interface for logging, authentication, and panic reporting. I already thought about writing a custom code generator...

@alecthomas
Copy link
Contributor

Yes, we are currently going down the second path (code generation) and it's really not ideal.

@iamqizhao
Copy link
Contributor

Allow HTTP-level server-side authentication (e.g., oauth2) and some sort of Before/After hooks.

@zellyn
Copy link
Contributor

zellyn commented Feb 2, 2016

Context: at Square, we are converting from our in-house Stubby-alike ("Sake") to GRPC. We're starting with just Server-side, since it takes little effort to also serve GRPC. In fact, Sake and GRPC method signatures are alike enough that I was able to refactor Sake into the same interface as GRPC in one big, ugly, but mostly mechanical set of commits. Our methods now take context.Context as the first arg, instead of a SakeContext, but they still expect a Sake context embedded in the context.Context.

We need the following abilities in server-side per-call interceptors:

  • the ability to add things to the context, so interceptors need to return a context.
    • our existing code (for now) still expects a SakeContext embedded in the Context. We plan to remove this once we switch completely to GRPC.
    • traceID etc. for our Dapper-alike
  • the ability to access the certs used for authentication: we do per-method ACLs, storing usernames in the OrganizationalUnit in our TLS certs.
  • the ability to retrieve ServiceName/MethodName of the current call (mostly used in logging code, but a few code paths depend on it for eg. having two versions of the same functionality, one of which can skip some permissions checks because it's only available internally)
  • the ability to abort the call and return
    • we do validation of some proto fields based on custom proto properties, using ServiceName/MethodName to look up descriptors. But we need to be able to return an error message if validation fails.

In server-side per-connection interceptors, we probably just need to be able to perform extra validation, and have access to the certs.

Our client side is where most complexity lives, and I haven't yet had time to figure out exactly how it maps to GRPC: we read custom properties from the proto service method descriptors to figure out whcih calls are idempotent, perform automatic retries, and interact with our Global Naming Service (a sort of gslb/gns combo) to do discovery and geographic failover on retries. At a minimum for now I know we intercept all calls and read custom proto properties to set default timeouts.

Apologies: this is neither as clear nor succinct as I would wish, but I'm pressed for time right now. Happy to elaborate.

@iamqizhao
Copy link
Contributor

@zellyn

Are they unary rpcs or streaming rpcs? If they are streaming rpcs, do you need to intercept every operation (message read and write) for an RPC on the server side?

@zellyn
Copy link
Contributor

zellyn commented Feb 2, 2016

Right now, we have no equivalent of streaming RPCs (although we have use cases where we'd love to have them), so we haven't thought as much about that side of things.

@jhump can probably add more thoughts, but I'd say the most important stuff would be at the start of the call. The only things that spring to mind as being per-message are validation, setting dapper-ish span ids. I'm assuming the initial context modifications would be preserved.

@jhump
Copy link
Member

jhump commented Feb 2, 2016

Agreed that the most important part is intercepting the start of a call. However, certain metrics (like total request and response message bytes, histograms) requires seeing each message in a stream. Also, validation of messages also would require seeing each piece of a stream.

It would be great to have something like the interceptor interfaces in the Java implementation. Albeit those interfaces are also lacking (as of this writing) in that, for authentication and quota enforcement, we also need access to the client IP address and authenticated identity (e.g. X509 principal).

@zellyn
Copy link
Contributor

zellyn commented Feb 2, 2016

Indeed, it would be lovely if the go and java (and ruby, and...) versions used similar terminology, although there's a tension with keeping the go code idiomatic

@sasha-s
Copy link

sasha-s commented Feb 20, 2016

In a meanwhile we are using https://github.com/sasha-s/grpc-instrument for server-side latency instrumentation.

@willtrking
Copy link

I wrote a boilerplate code generator which leverages go generate to easily implement the interceptor/middleware pattern on gRPC service calls. It's been of great help to me in my current project, so I figured other people might find it useful.

https://github.com/willtrking/grpcintercept

@xiang90
Copy link
Contributor

xiang90 commented Mar 29, 2016

@iamqizhao mentioned that there will be an official API soon.

@willtrking
Copy link

My method allows you do to things other then metrics/logging. Generally allows you to initialize new objects (database sessions, custom auth, generalized validation, etc.) on each incoming gRPC service call, and makes them available in your gRPC service logic through a single additional input param.

Unless my sleep deprived brain is missing something and things beyond metrics will be supported?

@stevvooe
Copy link
Contributor

stevvooe commented Apr 5, 2016

Rather than considering each use case, it seems more prudent to create a flexible abstraction that can handle cross-cutting concerns. Couldn't a lot of these use cases be covered by an Invoker interface, proposed in #239 (comment) and https://docs.google.com/document/d/1weUMpVfXO2isThsbHU8_AWTjUetHdoFe6ziW0n5ukVg (thank, @zellyn)?

@iamqizhao
Copy link
Contributor

@stevvooe, I am pretty sure we can address all of your concerns about LB. But to be clear your Invoker interface is no-go because it breaks a lot of user-facing API. Can you hold for a while for our new proposal?

For the issue here, the server side interceptor is under the internal code review and the client side interceptor has not been designed yet. https://docs.google.com/document/d/1weUMpVfXO2isThsbHU8_AWTjUetHdoFe6ziW0n5ukVg is one proposal from zellyn.

@rlmcpherson
Copy link

Is there any update or ETA for client-side interceptors? We need this as well at Yik Yak, for the reasons @zellyn described above and in his proposal document. Being able to instrument client calls is especially helpful and would also solve the problem with instrumenting the bigtable client described in googleapis/google-cloud-go#270 if the client-side interceptor is a grpc.DialOption

@iamqizhao
Copy link
Contributor

We need to sort out some pre-GA issues including some API breaking changes now. And then the client-side interceptor will be my priority. Because the GA date is still a bit murky, I cannot provide a ETA for client interceptor now. But I do hope to make it done (at least the design) this month.

@inconshreveable
Copy link

inconshreveable commented Jun 1, 2016

Looking forward to client-side interceptors as well for two major use cases:

  1. Automatically unmarshalling rich error objects from trailer metadata and overriding the returned error to the caller.
  2. Tracing instrumentation and logging/error reporting

@NeoCN
Copy link

NeoCN commented Jul 14, 2016

Is there any update for client-side interceptors? Without client interceptors, we have to wrap the gRPC client to do client instrument, and copy & paste method name again and again.

@iamqizhao
Copy link
Contributor

The design is on the way. Sorry about the delay. I will keep u posted.

@troylelandshields
Copy link

@iamqizhao, this is fantastic news. Are you will to give an ETA now that you've started working on it?

@alecthomas
Copy link
Contributor

Just wanted to say that the interceptor interface is great, works exactly as I'd hoped. Thanks @iamqizhao

@lock lock bot locked as resolved and limited conversation to collaborators Sep 26, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.