Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

cpp-client loses spans on short-lived processes; Close() not flushing buffers? #52

Closed
ringerc opened this issue Feb 5, 2018 · 9 comments

Comments

@ringerc
Copy link
Contributor

ringerc commented Feb 5, 2018

If a process:

  • Creates a tracer with a typical configuration (like the following yml)
  • Creates a span
  • Finishes the span
  • Closes the tracer
  • Exits

the span should be reported to the collector. But it isn't unless a short sleep is inserted before the tracer->Close() call.

If the sleep() is added after the tracer close, the spans are not sent. It looks like Close() must be failing to flush buffers, instead dropping any pending buffers on the floor.

Configuration, loaded with the YAML config support:

disabled: false
sampler:
    type: const
    param: 1
reporter:
    queueSize: 100
    bufferFlushInterval: 10
    logSpans: false
    localAgentHostPort: 127.0.0.1:6831
headers:
    jaegerDebugHeader: debug-id
    jaegerBaggageHeader: baggage
    TraceContextHeaderName: trace-id
    traceBaggageHeaderPrefix: "testctx-"
baggage_restrictions:
    denyBaggageOnInitializationFailure: false
    hostPort: 127.0.0.1:5778
    refreshInterval: 60
@ringerc
Copy link
Contributor Author

ringerc commented Feb 5, 2018

Related #53

@ringerc
Copy link
Contributor Author

ringerc commented Feb 5, 2018

RemoteReport::close() makes no effort to flush by the looks:

void RemoteReporter::close()
{
    {
        std::unique_lock<std::mutex> lock(_mutex);
        if (!_running) {
            return;
        }
        _running = false;
        lock.unlock();
        _cv.notify_one();
    }
    _thread.join();
}

ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 5, 2018
The RemoteReporter buffers spans, but didn't flush them on close.
So any spans Finish()ed between the last flush interval and the
Close() of the Tracer would be lost.

Fixes jaegertracing#52

Signed-off-by: Craig Ringer <[email protected]>
@ringerc
Copy link
Contributor Author

ringerc commented Feb 7, 2018

It probably makes sense to log dropped buffers on close instead, with a hint that an explicit flush before close is advised (using #53 for flush). But that'd really want the opentracing API to adopt the Flush method.

@rnburn
Copy link
Contributor

rnburn commented Feb 7, 2018

The Close method is meant to be optional; so if you don't want to wait on the tracer to flush all it's spans you can just not call it.

I think the right solution is for jaeger to move the current code in close to its destructor and change close to flush the spans, but require it to be called explicitly.

ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

Signed-off-by: Craig Ringer <[email protected]>
@ringerc
Copy link
Contributor Author

ringerc commented Feb 9, 2018

@rnburn Makes sense. I'll amend my push that adds an optional Tracer.flush() to Jaeger's tracer so that it also calls it on explicit Close(), but from the dtor.

@isaachier I've written a small test program to reproduce this bug. It reliably loses spans. ringerc@2a36dd5 (on master) or ringerc@ff654d9 (on my fix/feature integration branch, but not using flush).

jaegertracing::Tracer::flush() does not seem to be working/sufficient. If you uncomment

        //static_cast<jaegertracing::Tracer*>(tracer.get())->flush();

in src/jaegertracing/shortlived.cpp, it still often loses spans. Wireshark will show that the span was never sent.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 9, 2018

Additional test runs show that without flush() it loses spans pretty much 100% of the time.

If I uncomment flush() on my integration test branch it often takes several seconds to exit so I'm not waking a thread properly or something.

Anyway, current behaviour clearly broken.

ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

This variant is enhanced with arguments to control flush mode and sleep after
span finish.

Signed-off-by: Craig Ringer <[email protected]>
@ringerc
Copy link
Contributor Author

ringerc commented Feb 9, 2018

I pushed an improved test case (ringerc@e5a6234). It clearly shows that we lose spans unless:

  • We explicitly flush() with ./src/jaegertracing/shortlived "" 1 and the flush actually works (per below)
  • We sleep > 10 seconds with ./src/jaegertracing/shortlived "" 0 11

The flush must actually take effect. Sometimes it seems to get missed;

Test with opname "shortlived--flush-sleep0s, flush 1
flushing...  flushed (flush took 0ms)
not sleeping before exit 

in which case the span is also lost. So there's something wrong with my flush method. Locking issues?

Lowering the configured bufferFlushInterval reduces the necessary sleep, it's obviously an issue with not waking the flush thread properly.

@isaachier I could use advice at this point, my C++ and threading is ... limited.

I also have no idea how to integrate this into the regression tests. Crossdock somehow?

ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

This variant is enhanced with arguments to control flush mode and sleep after
span finish.

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
Extend the OpenTracing API with an explicit jaegertracing::Tracer::flush()
method to force spans to be flushed eagerly without closing the tracer. It
returns only when the spans are flushed.

Fixes jaegertracing#53

Call flush() from Close(), but not from the Tracer dtor. So we follow the spec
and ensure we flush buffers on explicit Close only.

Fixes jaegertracing#52

WIP. For as-yet-undiagnosed reasons `flush()` sometimes waits a full
bufferFlushInterval before returning.

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
Extend the OpenTracing API with an explicit jaegertracing::Tracer::flush()
method to force spans to be flushed eagerly without closing the tracer. It
returns only when the spans are flushed.

To support this a new condition variable is introduced in the reporter to allow
the main thread to wait on notification from the reporter flush thread.

Fixes jaegertracing#53

Call flush() from Close(), but not from the Tracer dtor. So we follow the spec
and ensure we flush buffers on explicit Close only.

Fixes jaegertracing#52

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 9, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

This variant is enhanced with arguments to control flush mode and sleep after
span finish.

Signed-off-by: Craig Ringer <[email protected]>
@isaachier
Copy link
Contributor

In general there might be a problem with blocking IO and threading. I'm about to call it a night, but will try to look at this tomorrow. Worst case, will be able to look over the weekend.

@isaachier
Copy link
Contributor

I think I have a simpler approach for this specific issue. About the flush method being exposed overall, that can be a new PR. See #59 for first fix.

ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 12, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

This variant is enhanced with arguments to control flush mode and sleep after
span finish.

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/jaeger-cpp-client that referenced this issue Feb 16, 2018
The RemoteReporter loses spans during short-lived executions (jaegertracing#52).

This is a test case not a fix.

This variant is on top of my integration branch for my feature branches.

This variant is enhanced with arguments to control flush mode and sleep after
span finish.

Signed-off-by: Craig Ringer <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants