Intermittent timeouts sending data to satellites when clients are under heavy load #92

hotchkj · 2020-08-13T17:26:50Z

Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.

What I observe is that the tracer has no mechanism to prevent ServicePoint.ConnectionLimit being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.

Requested Info	Value
Tested library version	v0.8.0.362
Runtime	NET Framework 4.8
Assembly target	NET Standard 2.0

A timeout visible on the client via proxy might look something like:
Example request:

POST https://satellite/api/v2/reports HTTP/1.1
Accept: application/octet-stream
Content-Type: application/octet-stream
Lightstep-Access-Token: token
Host: satellite
Content-Length: 509
Expect: 100-continue
(no body sent)

Satellite response recorded by proxy 30 seconds later:

HTTP/1.1 408 Request body incomplete
Date: Thu, 13 Aug 2020 13:12:07 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache, must-revalidate
Timestamp: 14:12:07.486
The request body did not contain the specified number of bytes. Got 0, expected 509

This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.

Preamble:

On NET Framework in Windows, HTTP is delegated via ServicePoint (per base address) which has a ConnectionLimit property (defaults to 2 courtesy of ServicePointManager.DefaultConnectionLimit default)
In addition, we're forced to use HTTP 1 rather than HTTP 2 because NET Framework & supported versions of Windows don't have HTTP 2 available
The older WinHttpHandler sends request header & body as two separate Win32 instructions

The tracing Flush is run via

lightstep-tracer-csharp/src/LightStep/Tracer.cs

Line 71 in bdb38ba

    
           _reportLoop = new Timer(async e => await Flush(), null, TimeSpan.Zero, _options.ReportPeriod);

I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:

The above Timer means that every Report Period (default 5 seconds), a new Flush is called i.e. N Flush operations can occur concurrently
Each Flush calls HttpClient.SendAsync which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosen
If the underlying ServicePoint connection isn't made available back to WinHttpHandler in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler : Timed out sending report to satellite

If you agree, I would propose having a WithConcurrentReports option that links to a SemaphoreSlim & connection limit on the HttpClientHandler, so that NET Framework consumers can control & match outgoing concurrency safely.

The text was updated successfully, but these errors were encountered:

austinlparker · 2020-08-13T18:33:56Z

Great idea! I can take a stab at this tomorrow. Thank you for the in-depth issue as well!

austinlparker · 2020-08-14T13:39:28Z

once again, .net 4.5 is my enemy - seems like the only option for setting max connections is modifying servicepointmanager, which hits the entire appdomain (unless there's something extremely obvious I'm missing). I could always set it to whatever value is passed in through WithConcurrentReports and change it back, or I could allow you to pass in your own handler (so you could tweak whatever values you want on HttpClient)

hotchkj · 2020-08-14T14:10:10Z

Ah yes, NET 4.5 does make life difficult. Unfortunately ServicePointManager doesn't actually apply until the first HTTP call is made, so I think for Framework usage where this is most relevant, it'd be best to leave that in the caller's hands.

One option to consider is building against net471 & netstandard2.1 as well and then only in those targets set HttpClientHandler.MaxConnectionsPerServer since it's available in those targets. Naturally there's overhead of extra targets to test & bundle in the NuGet package, so I don't know how palatable that is.

austinlparker · 2020-08-14T14:56:45Z

Let me check around internally and see if I can get us off net45, or at least leave it on an older version of the library.

austinlparker · 2020-08-24T20:42:33Z

Circling back on this, since I haven't really gotten a great answer - what if I let you pass in your own httpclient?

hotchkj · 2020-08-25T09:06:39Z

@austinlparker Funnily enough we've been researching that. The existing code is already pretty close to allowing this sort of flexibility. I think some minor modifications might make it easier.

Here's an approximation of what I've been trying out:

...
// Normal tracer code in DI
var httpClient = sp.GetRequiredService<ILightStepHttpClient>();
// Swapping the order of these would make extension that bit nicer
return new Tracer(options, new LightStepSpanRecorder(), httpClient);
...

public class CustomLightstepHttpClient : ILightStepHttpClient
{
  private readonly LightStepHttpClient _defaultImpl;
  private readonly ICustomLightstepClient _client;

  public CustomLightstepHttpClient(ICustomLightstepClient client, LightStep.Options options)
  {
    Check.IsNotNull(client, nameof(client));

    _client = client;
    
    // Has to forward on to the default implementation because I only want to override SendReport
    _defaultImpl = new LightStepHttpClient("hopeUrlNotUsedInTranslate", options);
  }

  public Task<ReportResponse> SendReport(ReportRequest report)
  {
    return _client.SendReport(report);
  }

  public ReportRequest Translate(ISpanRecorder spanBuffer)
  {
    // Ideally this would be a separate concern & interface
    return _defaultImpl.Translate(spanBuffer);
  }
}

public sealed class CustomLightstepClient : ICustomLightstepClient
{
  public async Task<ReportResponse> SendReport(ReportRequest report)
  {
    // Base address is easy to compute from Options
    // Has to hardcode "Lightstep-Access-Token" header, ideally added to LightStepConstants
    // I can impose as much or as little control over things like concurrency setup as I want in here
    // Lets' say my service requires all outgoing HTTP to have metrics recorded or Envoy mesh or other complexities, I can add that here
    
    using (var response = await _client.BuildRequest(nameof(SendReport))
                                       .POST
                                       .AddRelativeUrl(LightStepConstants.SatelliteReportPath)
                                       .SendingBody(() => new ByteArrayContent(report.ToByteArray()))
                                       .WithContentType(new MediaTypeHeaderValue("application/octet-stream"))
                                       .WithHeader("Lightstep-Access-Token", report.Auth.AccessToken)
                                       .GetResponseAsStreamAsync(CancellationToken.None).ConfigureAwait(false))
    {
      return ReportResponse.Parser.ParseFrom(response.Body);
    }
  }
}

With a separate interface for Translate and a constant for the Lightstep-Access-Token header, I think this would allow integrators complete flexibility over how to make the HTTP calls.

Hope that helps explain the direction we might want to take.

austinlparker · 2020-08-25T14:20:36Z

Take a look at #94 and see what you think - it should make it easier to do what you're looking for here.

hotchkj · 2020-08-25T14:58:54Z

@austinlparker That looks great, thank you for the quick turnaround!

austinlparker · 2020-08-25T21:39:20Z

Closed by #94 and 0.10.0 release

andrewhsu assigned austinlparker Aug 25, 2020

austinlparker closed this as completed Aug 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Intermittent timeouts sending data to satellites when clients are under heavy load #92

hotchkj commented Aug 13, 2020

austinlparker commented Aug 13, 2020

austinlparker commented Aug 14, 2020

hotchkj commented Aug 14, 2020

austinlparker commented Aug 14, 2020

austinlparker commented Aug 24, 2020

hotchkj commented Aug 25, 2020 •

edited

Loading

austinlparker commented Aug 25, 2020

hotchkj commented Aug 25, 2020

austinlparker commented Aug 25, 2020

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Comments

hotchkj commented Aug 13, 2020

austinlparker commented Aug 13, 2020

austinlparker commented Aug 14, 2020

hotchkj commented Aug 14, 2020

austinlparker commented Aug 14, 2020

austinlparker commented Aug 24, 2020

hotchkj commented Aug 25, 2020 • edited Loading

austinlparker commented Aug 25, 2020

hotchkj commented Aug 25, 2020

austinlparker commented Aug 25, 2020

hotchkj commented Aug 25, 2020 •

edited

Loading