-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instrumentation hooks #240
Comments
We use https://godoc.org/golang.org/x/net/trace to gather these stats. We On Wed, Jul 8, 2015 at 11:46 AM, Michal Witkowski [email protected]
|
Trace looks amazing! I'm really happy that go-rpc will have RPC debug pages straight out of the box :) However, the gRPC already has if EnableTracing {
c.traceInfo.tr = trace.New("Sent."+methodFamily(method), method)
defer c.traceInfo.tr.Finish() However, if you were wrapping trace, it would mean that the actual I think it may be better to extend |
Also, it seems that the |
The trace package is not the place for hooks. It is the reporting mechanism. I think #131 hashed out the stance on hooks/interceptors in grpc-go. We are very nervous about permitting hooks because a badly behaved hook has so much ability to drag things down. I would advocate you start with your option (a) for now. (If anyone's at GopherCon, I'm happy to have a sit down and chat about the options tomorrow.) |
@dsymonds I'm at GopherCon and I'm going to be at the Prometheus hackday meetup at 14:00 in the CoreOS room along with a bunch of other Prometheans and friends (or happy to chat elsewhere). I haven't used grpc yet, but it could be interesting to talk about this more. I wonder if @peterbourgon from https://github.com/go-kit/kit has thoughts about this as well. I guess this is mostly a matter of judgement. I would also really like to see instrumentation options that are less cumbersome than option a). For hooks I think it's a fair expectation that users have to write well-behaved hooks or suffer the consequences. At least that seems like pretty straightforward, non-surprising behavior. But I also appreciate being conservative about what features to add, and being really careful about not giving users any rope to hang themselves with. |
Please ping me if you sit down together. I'd like to at least listen :) |
I'm a little tied to the Google room, but maybe we can catch up later in the afternoon? |
@dsymonds @peterbourgon Sounds good - best channel to reach you (I know Twitter works for Peter)? Want to stop spamming the issue here :) |
Email me at dsymonds at golang.org. |
@dsymonds ack, thanks! |
So we talked about this in an illustrous round of people (@dsymonds, @peterbourgon, multiple Prometheus people). A brief summary:
If someone desires the latter approach, they should write a design document detailing the exact counters and dimensions needed, as well as what the asynchronous hook interface and backoff behavior should look like. Overall, it seemed like nobody was really happy with any of the approaches, so it looks like this is going to be a situation where you can just focus on choosing the least bad solution... |
Why cant we assume the user would do the correct thing?
Won't this approach actually complicate gRPC the most? Or gRPC itself do not have to implement and maintain all of metrics. This sounds even more difficult than agreeing on a hook interface. |
@xiang90 As I understood it, the worry was that the end user of grpc might not even be the one supplying the hook, such as when using grpc as part of a framework or similar. While the approach where grpc itself maintains counters is more complex, it would ensure that only predictable, fast calculations would happen in the hot path. FWIW, personally I agree that a normal hook interface would be the best (least bad) solution. I can see the above-mentioned downsides of it, but also think the other approaches are even nastier. |
Perhaps we should consider a lossy subscription interface; something like: The implementation of grpc.Subscribe can avoid blocking on slow subscribers by buffering up to bufSize then dropping records. How the queuing is implemented and is up to grpc, which keeps the critical path under the package's control (in particular, we will eventually want to be able to mitigate lock contention on multiprocessors under very high load). This externalizes all the metric aggregation and reporting; we may event want to move the net/trace integration behind this API. The main question I have is whether lossy reporting is OK for the kinds of metric aggregation and reporting that applications require. This API is not suitable for critical accounting (e.g., for billing), but it should be sufficient for monitoring and debugging. A secondary question is what policy to use when dropping records: drop oldest, drop newest, drop random, something else. We don't want to get into the business of pushing filters into Subscribe; all filtering and aggregation should happen outside grpc. Also: the only piece of this that's grpc-specific is TraceRecord. The rest could be provided by a separate Go package for building in-process publish-subscribe networks. |
So, I decided to take a first stab at this in #299 gRPC seems to allow users to override the marshalling/unmarhsalling codec with pretty much whatever they want. As such, users can already significantly impact the performance of gRPC by deciding to use something wonky. In the PR, I use a simple callback-based approach, that leaves a lot of flexibility to implement the monitoring as the user wishes. The choice of which implementation is used is made through An example server-side implementation is provided for Prometheus, mimicked after the level of instrumentation one can find in Google servers by default. heres an example of a |
@Sajmani, speaking from the perspective of internal, first-party use, I would want the fidelity of a lossless API (I am sure plenty of others would agree if you put the question to them). Most of the heretofore discussions around service level indicators (SLI) have not had the luxury to speak about things in terms of reporting error and just assume perfect measurement, for better or worse. |
@Sajmani @matttproud As long as bufSize is configurable, I do not see a problem here. The user can configure a large bufSize if they need lossless data and the machine can afford it. |
@iamqizhao even if I tend to agree with @matttproud regarding lossless monitoring. As a SRE or a systems operator, I want to have full confidence that the counter I'm observing has been incremented on every RPC. If it was lossy, I probably want the monitoring to export a counter that said that a certain number of values was dropped... but that breeds a chicken and egg problem. |
On Wed, Aug 19, 2015 at 3:30 AM, Michal Witkowski [email protected]
|
@iamqizhao, what's your take about the change proposed in #299 ? I know @matttproud commented on it, and @pires said that API would be great for adding InfluxDB support as well (another popular monitoring framework). Given that https://groups.google.com/d/msg/grpc-io/K8Ar3wom5CM/ushngUvb5GYJ suggests that third party monitoring support is on the books, can we get some traction on the subject? Or if the PR doesn't fit your needs, can you at least comment if the edge cases of client-side monitoring are covered in it correctly? We're already using it and I would sleep better knowing that it's correct. |
@mwitkow-io Sorry for the delay. I have got caught by some tight deadlines on some other stuffs these 2 weeks. I will have a close look early next week. |
For the record, the C++ tree has early code for a Census implementation: https://github.com/grpc/grpc/tree/master/src/core/census |
Just an additional perspective here. grpc-go needs an interceptor/hook interface. The lack of an interceptor/hook interface makes grpc-go a complete non-starter for production applications. Exposing metrics is not a reasonable compromise of a solution. The metrics important to me will be different from others and the implementation will undoubtedly be unable to satisfy everyone. There are other hooks that are unrelated to metrics that would be important to add like logging, authentication, panic reporting, etc. Of those I know using go-grpc in production, one of the following is true:
While I sympathize with @dsymonds worry about misbehaving middleware, it is unfounded. It is akin to arguing that there should be no |
Well written, @inconshreveable. I agree to every single point. I need an interceptor/hook interface for logging, authentication, and panic reporting. I already thought about writing a custom code generator... |
Yes, we are currently going down the second path (code generation) and it's really not ideal. |
Allow HTTP-level server-side authentication (e.g., oauth2) and some sort of Before/After hooks. |
Context: at Square, we are converting from our in-house Stubby-alike ("Sake") to GRPC. We're starting with just Server-side, since it takes little effort to also serve GRPC. In fact, Sake and GRPC method signatures are alike enough that I was able to refactor Sake into the same interface as GRPC in one big, ugly, but mostly mechanical set of commits. Our methods now take context.Context as the first arg, instead of a SakeContext, but they still expect a Sake context embedded in the context.Context. We need the following abilities in server-side per-call interceptors:
In server-side per-connection interceptors, we probably just need to be able to perform extra validation, and have access to the certs. Our client side is where most complexity lives, and I haven't yet had time to figure out exactly how it maps to GRPC: we read custom properties from the proto service method descriptors to figure out whcih calls are idempotent, perform automatic retries, and interact with our Global Naming Service (a sort of gslb/gns combo) to do discovery and geographic failover on retries. At a minimum for now I know we intercept all calls and read custom proto properties to set default timeouts. Apologies: this is neither as clear nor succinct as I would wish, but I'm pressed for time right now. Happy to elaborate. |
Are they unary rpcs or streaming rpcs? If they are streaming rpcs, do you need to intercept every operation (message read and write) for an RPC on the server side? |
Right now, we have no equivalent of streaming RPCs (although we have use cases where we'd love to have them), so we haven't thought as much about that side of things. @jhump can probably add more thoughts, but I'd say the most important stuff would be at the start of the call. The only things that spring to mind as being per-message are validation, setting dapper-ish span ids. I'm assuming the initial context modifications would be preserved. |
Agreed that the most important part is intercepting the start of a call. However, certain metrics (like total request and response message bytes, histograms) requires seeing each message in a stream. Also, validation of messages also would require seeing each piece of a stream. It would be great to have something like the interceptor interfaces in the Java implementation. Albeit those interfaces are also lacking (as of this writing) in that, for authentication and quota enforcement, we also need access to the client IP address and authenticated identity (e.g. X509 principal). |
Indeed, it would be lovely if the go and java (and ruby, and...) versions used similar terminology, although there's a tension with keeping the go code idiomatic |
In a meanwhile we are using https://github.com/sasha-s/grpc-instrument for server-side latency instrumentation. |
I wrote a boilerplate code generator which leverages go generate to easily implement the interceptor/middleware pattern on gRPC service calls. It's been of great help to me in my current project, so I figured other people might find it useful. |
@iamqizhao mentioned that there will be an official API soon. |
My method allows you do to things other then metrics/logging. Generally allows you to initialize new objects (database sessions, custom auth, generalized validation, etc.) on each incoming gRPC service call, and makes them available in your gRPC service logic through a single additional input param. Unless my sleep deprived brain is missing something and things beyond metrics will be supported? |
Rather than considering each use case, it seems more prudent to create a flexible abstraction that can handle cross-cutting concerns. Couldn't a lot of these use cases be covered by an |
@stevvooe, I am pretty sure we can address all of your concerns about LB. But to be clear your Invoker interface is no-go because it breaks a lot of user-facing API. Can you hold for a while for our new proposal? For the issue here, the server side interceptor is under the internal code review and the client side interceptor has not been designed yet. https://docs.google.com/document/d/1weUMpVfXO2isThsbHU8_AWTjUetHdoFe6ziW0n5ukVg is one proposal from zellyn. |
Is there any update or ETA for client-side interceptors? We need this as well at Yik Yak, for the reasons @zellyn described above and in his proposal document. Being able to instrument client calls is especially helpful and would also solve the problem with instrumenting the bigtable client described in googleapis/google-cloud-go#270 if the client-side interceptor is a |
We need to sort out some pre-GA issues including some API breaking changes now. And then the client-side interceptor will be my priority. Because the GA date is still a bit murky, I cannot provide a ETA for client interceptor now. But I do hope to make it done (at least the design) this month. |
Looking forward to client-side interceptors as well for two major use cases:
|
Is there any update for client-side interceptors? Without client interceptors, we have to wrap the gRPC client to do client instrument, and copy & paste method name again and again. |
The design is on the way. Sorry about the delay. I will keep u posted. |
@iamqizhao, this is fantastic news. Are you will to give an ETA now that you've started working on it? |
Just wanted to say that the interceptor interface is great, works exactly as I'd hoped. Thanks @iamqizhao |
We're currently experimenting with GRPC and wondering how we'll monitor the client code/server code dispatch using Prometheus metrics (should look familiar ;)
I've been looking for a place in the grpc-go to be able to hook up gathering of
ServiceName
,MethodName
,bytes
,latency
data, and found none.Reading upon the thread in #131 about RPC interceptors, it is suggested to add the instrumentation in our Application Code (a.k.a. the code implementing the auto-generated Proto interfaces). I see the point about not cluttering grpc-go implementation and being implementation agnostic.
However, adding instrumentation into Application Code means that either we need to:
a) add a lot of repeatable code inside Application Code to handle instrumentation
b) use the
callFoo
pattern described in #131 proposed [only applicable to Client]c) add a thin implementation of each Proto-generated interface that wraps the "real" Proto-generated method calls with metrics [only applicable to Client]
There are downsides to each solution though:
a) leads to a lot of clutter and errors related to copy pasting, and some of these will be omitted or badly done
b) means that we loose the best (IMHO) feature of Proto-generated interfaces: the "natural" syntax that allows for easy mocking in unit tests (through injection of the Proto-generated Interface), and is only applicable on the Client-side
c) is very tedious because each time we re-generate the Proto (add a method or a service) we need to go and manually copy paste some boiler plate. This would be a huge drag on our coding workflow, since we really want to rely on Proto-generate code as much as possible. And also is only applicable on the Client-side.
I think that cleanest solution would be a pluggable set of callbacks on pre-call/post-call on client/server that would grant access to
ServiceName
,MethodName
andRpcContext
(provided the latter stats about bytes transferred/start time of the call). This would allow people to plug an instrumentation mechanism of their choice (statsd, grafana, Prometheus), and shouldn't have any impact on performance that interceptors described in #131 could have had (the double serialization/deserialization).Having seen how amazingly useful RPC instrumentation was inside Google, I'm sure you've been thinking about solving this in gRPC, and I'm curious to know what you're planning :)
The text was updated successfully, but these errors were encountered: