Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add jaeger/opentracing tracing to m3query #1321

Merged
merged 4 commits into from
Feb 21, 2019
Merged

Conversation

andrewmains12
Copy link
Contributor

This diff adds opentracing support to m3query. Spans are started for every HTTP request (courtesy middleware from github.com/opentracing-contrib/go-stdlib). I then have child spans started for compilation + execution, with drilldowns into each node of the query. Here's a sample trace:

Query is sum(increase(coordinator_engine_datapoints{type="fetched"}[10s]))

jaeger_trace_no_children

One implementation note: node instrumentation is a bit less clean than I would have liked; since all nodes call their children before exiting, i.e.:

nextBlock := doStuff()
n.controller.Process(nextBlock)

I instrumented them like:

startSpan()
nextBlock := doStuff()
stopSpan()

n.controller.Process(nextBlock)

The alternative would be to do:

startSpan()
nextBlock := doStuff()

n.controller.Process(nextBlock)
stopSpan()

but that gives unnecessary nesting imo:

jaeger_tracing_children

@andrewmains12 andrewmains12 force-pushed the andrewmains12/tracing branch 2 times, most recently from d17e4a7 to 4ca756b Compare January 28, 2019 21:21
@codecov
Copy link

codecov bot commented Jan 28, 2019

Codecov Report

❗ No coverage uploaded for pull request base (andrewmains12/cost_accounting@35d47f1). Click here to learn what that means.
The diff coverage is 47.8%.

Impacted file tree graph

@@                       Coverage Diff                       @@
##             andrewmains12/cost_accounting   #1321   +/-   ##
===============================================================
  Coverage                                 ?   64.3%           
===============================================================
  Files                                    ?     771           
  Lines                                    ?   64218           
  Branches                                 ?       0           
===============================================================
  Hits                                     ?   41333           
  Misses                                   ?   20009           
  Partials                                 ?    2876
Flag Coverage Δ
#aggregator 73.6% <ø> (?)
#cluster 74.9% <ø> (?)
#collector 78.4% <ø> (?)
#dbnode 75.5% <ø> (?)
#m3em 68.3% <ø> (?)
#m3ninx 73.5% <ø> (?)
#m3nsch 28.4% <ø> (?)
#metrics 17.8% <ø> (?)
#msg 74.9% <ø> (?)
#query 47.5% <43.9%> (?)
#x 75.3% <77%> (?)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35d47f1...207d695. Read the comment docs.

@@ -0,0 +1,5 @@

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops this was pretty much just my own notes--removing.

func (r *Request) execute(ctx context.Context, pp plan.PhysicalPlan) (*ExecutionState, error) {
sp := startSpan(r.engine.metrics.executingHist, r.engine.metrics.executing)
func (r *Request) generateExecutionState(ctx context.Context, pp plan.PhysicalPlan) (*ExecutionState, error) {
sp, ctx := opentracing.StartSpanFromContext(ctx, "generateExecutionState")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the naming semantics for opentracing, but do they recommend camel case or using _ or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yuri says _--I'll change it. Good catch!

ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error)
}

// ProcessSimpleBlock is a utility for OpNode instances which simply propagate their data after doing their own
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment!

src/query/executor/request.go Show resolved Hide resolved

func TestProcessSimpleBlock(t *testing.T) {
t.Run("closes next block", func(t *testing.T) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I should have filled these tests in before pushing. Doing that now.

// Execute runs the fetch node operation
func (n *FetchNode) Execute(queryCtx *models.QueryContext) error {
ctx := queryCtx.Ctx
blockResult, err := n.fetch(queryCtx.Ctx, queryCtx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ctx instead of queryCtx.Ctx?

docs/operational_guide/monitoring.md Outdated Show resolved Hide resolved
docs/operational_guide/monitoring.md Outdated Show resolved Hide resolved
docs/operational_guide/monitoring.md Outdated Show resolved Hide resolved
src/query/util/opentracing/context_test.go Show resolved Hide resolved
Error: respErr.Err.Error(),
},
RqID: logging.ReadContextID(ctx),
}, logging.WithContext(ctx))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to log id in the error returned to the client? The failing request ID is still logged since the logger has the context id on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. So the idea here is to help aid in debugging; by returning the request id to the client, we make it easy to tie an error to logs + traces. Basically, we can just ask the user for the RqID from grafana's query inspector, and go directly from that to traces + logs.

Does that seem reasonable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, doesn't seem like it would cause issues. We've only got tracing on read right? Might want to add the same thing to read_instantaneous, too... Would it make sense to take this block and put it into xhttp as a method like WriteJSONResponseWithRequestID?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good call, I shouldn't have been lazy about it here :). I'll factor it out.

@@ -88,6 +88,10 @@ type countValuesNode struct {
controller *transform.Controller
}

func (n *countValuesNode) Params() parser.Params {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this function (and other snowflakes existing outside of various base.go files) be refactored to the ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good call, missed these. This is one reason it's too bad we can't put tracing etc into Controller (at least, not without having weird traces); it makes the fanout of this change (and potentially other similar ones) much wider. That said I don't have a good alternative model immediately handy so I'll go through and do the switch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a bit annoying unfortunately; I guess it's a downside that we have to deal with in return for the nice properties of functions calling their own downstream processing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for a brief moment I was like "we should just explicitly walk the graph instead" and then I realized that would be a large rewrite 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, was looking at making a few nicety fixes to DAG generation/execution/mapping, and it's a pretty hefty beast unfortunately to make it more usable for things like this (although it would definitely not go astray!)

ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error)
}

// ProcessSimpleBlock is a utility for OpNode instances which simply propagate their data after doing their own
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this describe every block type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha yeah true. The "exception" is nodes which need to wait for multiple blocks before actually executing, e.g. binary nodes. I'll update the comment to clarify that.

"github.com/m3db/m3/src/query/util/opentracing"
)

type simpleOpNode interface {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit weird having an interface that a bunch of nodes satisfy be private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is definitely a golang quirk, but I'd argue that having it be private makes reasonable sense. I don't necessarily want other packages using the interface directly; its only reason for existence is ProcessSimpleBlock. I would even make it anonymous if that were a thing go supported.

We can also always make it public later :)

require.NoError(t, doCall(tctx))
})

t.Run("errors on process error", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: same comment about the t.Run, seems to swallow up useful information for failing tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it, mostly had problems with it in tabled tests, when iterating over them... when one of twenty fails it's sometimes a pain to find exactly which test failed exactly where.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, totally agreed on that! I always try to use searchable names in those cases, which helps, but it can still be annoying.

src/query/functions/temporal/base.go Outdated Show resolved Hide resolved
@andrewmains12 andrewmains12 changed the base branch from andrewmains12/copy_cors_jaeger to andrewmains12/cost_accounting February 14, 2019 16:22
@@ -32,6 +33,12 @@ type sinkNode struct {
block block.Block
}

// Params for sinkNode returns an params object for lazy sinks. It doesn't appear in queries (since sink nodes are
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this one may have fallen through the cracks (thanks github for collapsing comments for some reason 🙄 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good catch--lint was complaining at me earlier about not having a comment on this; turns out that's because I don't actually need it to implement the interface. Removed.

return transform.ProcessSimpleBlock(n, n.controller, queryCtx, ID, b)
}

func (n *takeNode) ProcessBlock(queryCtx *models.QueryContext, ID parser.NodeID, b block.Block) (block.Block, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -142,6 +147,14 @@ func (n *baseNode) Process(queryCtx *models.QueryContext, ID parser.NodeID, b bl
return n.controller.Process(queryCtx, nextBlock)
}

func (n *baseNode) processWithTracing(queryCtx *models.QueryContext, lhs block.Block, rhs block.Block) (block.Block, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but this will show up on the span as two function calls, one of which will be pretty much instantaneous, right? Should be fine, and probably not worth overengineering here to avoid that case (like starting the span only when both LHS and RHS satisfied), but just trying to get an idea of how it will look :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, it might be necessary to add a parent span to NewScalarOp, since there is a potential (highly unlikely, but still) scenario where you have a binary function between a series and a scalar, e.g. up + 1, and somehow the up block resolves first, which may cut off your trace early if I'm understanding the flow correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but this will show up on the span as two function calls, one of which will be pretty much instantaneous, right?

Ah yeah good question. It actually shows up as a single trace, since I am in fact starting the span only when both LHS and RHS are satisfied.

Here's a sample:

screen shot 2019-02-15 at 3 21 15 pm

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

blocks, err := c.processCompletedBlocks(queryCtx, processRequests, maxBlocks)
defer closeBlocks(blocks)

if err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So probably(definitely!) out of scope for this PR, but would be good to revisit temporal functions at some point and add in some logic that if any incoming block errored, it should drop any incoming blocks with the same error instead of trying to add them and starting processing before hitting it... May add an issue regarding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! What do you mean by "any incoming block errored"? As in, there was an error processing it or an error upstream? I'm definitely not opposed to exiting earlier when possible though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general function of temporal functions is that blocks come in, and are added to a cache. When enough blocks come in to fully calculate a downstream block, it is processed (which can error). At the moment, if the processing fails, that particular request will fail but there will still be in-flight blocks coming into the temporal function, which do not necessarily need to be computed since the entire query will error anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok cool, thanks for the clarification. That seems like a kind of generic problem--other parts of the query should be cancelled once one part fails. Do we have any general cancellation mechanism already? I suppose we could check the context periodically during each node's Process; that would probably be the idiomatic way to do it. No idea how often "periodically" should be though haha.

@andrewmains12 andrewmains12 force-pushed the andrewmains12/cost_accounting branch from 35d47f1 to 9d59070 Compare February 15, 2019 22:06
@andrewmains12 andrewmains12 force-pushed the andrewmains12/cost_accounting branch from 9d59070 to add5124 Compare February 15, 2019 22:29
@andrewmains12 andrewmains12 changed the base branch from andrewmains12/cost_accounting to andrewmains12/query_ctx February 21, 2019 04:13
@andrewmains12 andrewmains12 force-pushed the andrewmains12/tracing branch 2 times, most recently from 7a885dc to fc0609e Compare February 21, 2019 17:13
@andrewmains12 andrewmains12 force-pushed the andrewmains12/query_ctx branch from a2b2d93 to 1275da2 Compare February 21, 2019 17:21
andrewmains12 added a commit that referenced this pull request Feb 21, 2019
Adds a `queryContext` argument to `OpNode.Process` to hold any per query state. I use this in both my [cost accounting](#1207) and [tracing](#1321) PR's. At the time, I based my tracing branch off of the cost accounting branch. Tracing is closer to landing though, so I've now factored out the common changes, and rebased them both against this branch.
@andrewmains12 andrewmains12 changed the base branch from andrewmains12/query_ctx to master February 21, 2019 19:10
@@ -167,6 +170,9 @@ func (h *PromReadHandler) ServeHTTPWithEngine(

result, err := read(ctx, engine, h.tagOpts, w, params)
if err != nil {
sp := opentracingutil.SpanFromContextOrRoot(ctx)
sp.LogFields(opentracinglog.Error(err))
opentracingext.Error.Set(sp, true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have sp.Finish() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, no; that function is intended to extract the span from context if it exists, and return a noop/dummy if not. In the first case, the calling function is already in charge of calling Finish(); in the second, it shouldn't be needed. Does that make sense?

I updated the function to be opentracingutil.SpanFromContextOrNoop, making it return a noop span, which should ensure that it doesn't need to be closed even if called with a context without a span.

src/query/api/v1/httpd/handler.go Outdated Show resolved Hide resolved
src/query/api/v1/httpd/handler.go Outdated Show resolved Hide resolved
insight into query performance and errors.

### Configuration
Currently, only https://www.jaegertracing.io/ is supported as a backend.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe only [Jaeger](https://www.jaegertracing.io/) is supported as a tracing backend.

docs/operational_guide/monitoring.md Show resolved Hide resolved
return sp
}

return opentracing.StartSpan("SpanFromContextOrRoot - dummy")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this string appear re: trace? Could be good to make it a constant if nothing else

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah good question. So in properly configured production usage, this branch will never be hit; the general expectation is that functions calling this aren't at the root level, and will already have a span passed in. Having the dummy span case instead of panicking is basically a convenience for tests and callers who don't really care about tracing. Does that make sense?

One other thing I could do is have this function take in the root name as an argument instead--that might help folks track down any misconfigured call paths if they're hooking up tracing.

src/query/util/logging/log.go Outdated Show resolved Hide resolved
RqID string `json:"rqID"`
}

// ErrorWithReqID writes an xhttp.ErrorResponse with an added request id (RqId) field read from the request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit1: could be better to call it ErrorWithReq rather than specifying it must be a reqID?

nit2: Split the NB out

// context.
//
// NB: ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.e. 1--sure, changed it to ErrorWithReqInfo (more generic + still accurate)

nit2: Split the NB out

Sure, done.


## Logs

TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to remove these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh I don't see much harm in having stubs. I switched it to "TODO: document how to retrieve metrics for M3DB components." so that people don't think the metrics themselves are TODO.

@@ -139,6 +136,23 @@ func NewHandler(
return h, nil
}

func applyMiddleware(base http.Handler) http.Handler {
rtn := http.Handler(&cors.Handler{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better var name?

src/query/api/v1/httpd/handler.go Outdated Show resolved Hide resolved
src/query/util/opentracing/context.go Outdated Show resolved Hide resolved
src/query/util/opentracing/context_test.go Outdated Show resolved Hide resolved
src/query/util/httperrors/errors.go Outdated Show resolved Hide resolved
@andrewmains12 andrewmains12 merged commit fa7efee into master Feb 21, 2019
@andrewmains12 andrewmains12 deleted the andrewmains12/tracing branch February 21, 2019 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants