Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite gRPC server interceptor #1171

Merged
merged 17 commits into from
Oct 29, 2020
Merged

Conversation

alertedsnake
Copy link
Contributor

Description

Rework the gRPC server interceptor to follow current practices with gRPC, and to fix several problems:

  • The previous method would wrap the grpc.Server() improperly in order to add its interceptor, preventing interceptors from be added after the OpenTelemetry interceptor. I've removed the wrapper entirely, as it's not necessary, and made the Interceptor be part of the public API, to allow users to instantiate this and insert it into the interceptor chain anywhere they want. There is still the instrument and uninstrument methods to basically monkeypatch the gRPC library, should you choose to use that, but these are reworked to provide the Interceptor in a more modern way.

  • The wrapper around the gRPC Context was incompletely implemented, probably before new features were added.

  • The method used to try to catch errors in RPCs was incomplete - interceptors cannot inspect the results of a call, so we now use our Context wrapper to obtain response codes and messages. We do still catch non-abort exceptions (i.e. those which are not an empty Exception()) in the Interceptor, and provide any message via the span.

New features:

  • There are now additional data added to the span for each call: status code and message, peer network info, and user-agent.
  • Reporting on bare (i.e. non gRPC abort()) exceptions
  • Less code :)

Fixes #1112

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

This change is slightly breaking, depending on how you used the library. If you follow the minimalistic example in the docs, nothing will have changed, but if you implemented like in the sample application, there will be changes required.

How Has This Been Tested?

I've used the standard tests provided, and implemented a detailed application with this library as it is now. I could probably add some tests for the new data in the spans, but I'm not very familiar with that part of OpenTelemetry.

Checklist:

  • Followed the style guidelines of this project
  • Changelogs have been updated
  • Unit tests have been added
  • Documentation has been updated

I'm not sure which changelog needs to be updated, or what to put in it.

@alertedsnake alertedsnake requested a review from a team September 28, 2020 01:23
@codeboten codeboten added the instrumentation Related to the instrumentation of third party libraries or frameworks label Oct 1, 2020
@NathanielRN
Copy link
Contributor

Hello! I have a PR to move some files you have in this PR to the Contrib repo, please let me know if this gets merged before the PR in the Contrib repo. Please see https://github.com/open-telemetry/opentelemetry-python-contrib/pulls/59

@alertedsnake
Copy link
Contributor Author

I'd be happy to make this PR in the other repo if you all prefer...


# Split up the peer to keep with how other telemetry sources
# do it. This looks like ipv6:[::1]:57284 or ipv4:127.0.0.1:57284.
host, port = context.peer().split(":", 1)[1].rsplit(":", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this guaranteed to b the same format?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, but I could put a check in there to be sure.

Copy link
Contributor Author

@alertedsnake alertedsnake Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some references to expected formats in the C library, so it looks like yes, this is the format it will be.... except for that one which is comma-separated, so I definitely need to update this. Though I'm not sure I understand why there are two addresses there.

Good catch!

https://github.com/grpc/grpc/blob/master/test/core/client_channel/resolvers/sockaddr_resolver_test.cc#L90

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I'd suggest to be defensive around code like this and perhaps wrap it in a try .. except block with an error log. We don't want instrumentation to break apps :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated it and made the log a warning rather than error, since it's not really an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't read it. I'd say it actually is an error as an exception is raised. We are just silencing the error so it doesn't crash users' apps. IMO it should surface in the loudest possible manner without crashing apps so we get to know about it and fix the instrumentation. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. This condition is not something I as the application developer would have any control over, and by making it an error rather than warning, now I have to consider suppressing messages from the tracing library in order to not have messages about errors I can't fix in my logs, which I may do things like trigger alerts with via tools such as sentry.io.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

Copy link
Contributor

@owais owais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Looks like a nice improvement. Left some comments/questions.

# we handle in our context wrapper.
# Here, we're interested in uncaught exceptions.
# pylint:disable=unidiomatic-typecheck
if type(error) != Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concerned that this might result in some non-gRPC abort exceptions being ignored. Do gRPC aborts raise Exception instead of some specialized type like GRPCAbortException? Is there any other way to test for this like exception message?

Also, what if we did record gRPC abort exceptions? Is that unexpected? Does gRPC use exceptions as flow control instead of error in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah unfortunately once we're at this point, if the code called grpc.abort() the exception here is a bare Exception with no useful details at all (see here: https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_server.py#L319), and there's no other place to capture it. I've tried a whole bunch of ways to get this over the past year or so, and this is the best I've come up with.

If abort() was not called, the exception here is whatever was raised, and ultimately it'll get wrapped up in a grpc.RpcError with status = UNKNOWN (HTTP 500 equivalent) and the client will receive that result.

I've certainly tried to find some better way to handle this, but there's just not any other information available to the interceptor here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the code you linked, it seems the function is setting always sets code to a non-OK value indicating this is always an error condition. In that case, may be we should always record it with span.record_exception()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay that works for me - I'm not as familiar with Otel itself as you guys :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually - I forgot how this actually works here. A bare exception is going to usually be a grpc abort, but not all of those are actually exceptions, there are user errors such as the equivalents of HTTP 400 series which aren't really server exceptions. Do we want to mark those as exceptions in Otel? I wouldn't think so...

I handle grpc aborts in the abort() call in _OpenTelemetryServicerContext (and can certainly use span.record_exception() for the non-user stuff there), this block here was intended to catch anything that is NOT an abort, and therefore a real exception, which are unlikely to be bare Exception (because of good practice).

There's no other way to distinguish a true bare Exception from a grpc abort here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

404s are not treated as errors generally, no. Does this case happen in case of a 404?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - gRPC unfortunately uses exceptions as program flow, so all 400-like errors which you cause by running grpc.abort() show up here, as bare exceptions. That's the part I was trying to work around and why I didn't treat those as exceptions.

I can add some checks for the "not user error" types of exceptions in the abort() call in _OpenTelemetryServicerContext, it's a small number of them that are actual exceptional conditions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's do it if it is easy enough and doesn't complicate the instrumentation too much. Otherwise I think it's fine the way you have it as it is not that likely that users will raise bare Exception (but still possible)

Copy link
Contributor Author

@alertedsnake alertedsnake Oct 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh this is complicated - at the point where I'd like to record this as an exception, there's not a useful exception to record, because aborts aren't actually exceptions.

Following conventions in the docs on HTTP details in traces, I'll set the Status properly.
Edit: it seems I already do, but I'll make sure these are set properly, they aren't right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay did a bit more digging - it's been a while since I wrote this. The parts in the Context wrapper where I set the status code line up with how Otel status codes work - they match 1:1 with the grpc.StatusCode values and so there's no change needed here to handle these cases.

# Here, we're interested in uncaught exceptions.
# pylint:disable=unidiomatic-typecheck
if type(error) != Exception:
span.set_attribute("error", repr(error))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exceptions should be recorded as events in Otel. I think we should use span.record_exception(error) here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I didn't know that was a thing - will do.

@codeboten
Copy link
Contributor

Since this PR has been opened for some time and there's already some comments from @owais, my preference would be to get it merged in this repo and then move it over. @aabmass any chance you'll get time to review it?

@NathanielRN
Copy link
Contributor

The last PR for the move to Contrib just needs this change to go in so that we can take the changes and be fully moved over!

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you for the PR! LGTM just a few comments on the attributes

@alertedsnake
Copy link
Contributor Author

FYI the linter test passes locally, and the code I changed in the most recent PR doesn't change enough functionality to start failing a test - I assume something changed? Do I rebase?

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NathanielRN
Copy link
Contributor

I also find it weird that the lint changes failed, but I think a rebase would be the right way to go either way since it looks like f3cdfa2 changed StatusCanonicalCodes to StatusCode.

Michael Stella and others added 16 commits October 29, 2020 14:49
I really dislike this pattern, changing other libraries, but it's the
way that most of these things work, so I'll stick with it.  One can
still use the Interceptor directly if desired.

Also reworked the wrapped server so it's no longer a wrapper, but simply
adds on our interceptor.  This allows adding any custom interceptors as
well.

A caveat here is that if you need to specifically order interceptors,
you'll need to do it manually rather than using the instrumentor-enable
thing.
Based on a suggestion by @owais - thanks!
Apparently the peer value can include a comma-separated address, see the
tests here:

https://github.com/grpc/grpc/blob/master/test/core/client_channel/resolvers/sockaddr_resolver_test.cc#L90

This updates the parsing, and adds a try/catch block with a warning log.
…emetry/instrumentation/grpc/__init__.py

Co-authored-by: Aaron Abbott <[email protected]>
Per suggestion, these attributes should be in the `rpc` namespace.
I was using Zipkin and some useful flags were missing, but when using
Jaeger they're present, so I didn't actually need to set these
explicitly.
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR and for working through all the feedback! This looks good, just one question that comes to mind, do we need to create an issue to handle errors differently now that they no longer map to grpc status codes after this change #1214, if so we can address in a separate issue/pr

@codeboten codeboten merged commit ffa3b39 into open-telemetry:master Oct 29, 2020
@alertedsnake
Copy link
Contributor Author

Thanks for the PR and for working through all the feedback! This looks good, just one question that comes to mind, do we need to create an issue to handle errors differently now that they no longer map to grpc status codes after this change #1214, if so we can address in a separate issue/pr

Oh I didn't see that. It does seem like we should... do you have a link to the current official doc? The PR you linked and the master branch both still have StatusCanonicalCode whereas that's now just StatusCode.

@alertedsnake
Copy link
Contributor Author

Oops, I was looking at the wrong version of this file - I see what you mean now.

I'll make another PR with the fix tonight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
instrumentation Related to the instrumentation of third party libraries or frameworks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gRPC instrumentation - how to add extra context data to spans?
5 participants