Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent timeouts sending data to satellites when clients are under heavy load #92

Closed
hotchkj opened this issue Aug 13, 2020 · 9 comments
Assignees

Comments

@hotchkj
Copy link

hotchkj commented Aug 13, 2020

Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.

What I observe is that the tracer has no mechanism to prevent ServicePoint.ConnectionLimit being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.

Requested Info Value
Tested library version v0.8.0.362
Runtime NET Framework 4.8
Assembly target NET Standard 2.0

A timeout visible on the client via proxy might look something like:
Example request:

POST https://satellite/api/v2/reports HTTP/1.1
Accept: application/octet-stream
Content-Type: application/octet-stream
Lightstep-Access-Token: token
Host: satellite
Content-Length: 509
Expect: 100-continue
(no body sent)

Satellite response recorded by proxy 30 seconds later:

HTTP/1.1 408 Request body incomplete
Date: Thu, 13 Aug 2020 13:12:07 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache, must-revalidate
Timestamp: 14:12:07.486
The request body did not contain the specified number of bytes. Got 0, expected 509

This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.

Preamble:

  • On NET Framework in Windows, HTTP is delegated via ServicePoint (per base address) which has a ConnectionLimit property (defaults to 2 courtesy of ServicePointManager.DefaultConnectionLimit default)
  • In addition, we're forced to use HTTP 1 rather than HTTP 2 because NET Framework & supported versions of Windows don't have HTTP 2 available
  • The older WinHttpHandler sends request header & body as two separate Win32 instructions

The tracing Flush is run via

_reportLoop = new Timer(async e => await Flush(), null, TimeSpan.Zero, _options.ReportPeriod);

I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:

  • The above Timer means that every Report Period (default 5 seconds), a new Flush is called i.e. N Flush operations can occur concurrently
  • Each Flush calls HttpClient.SendAsync which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosen
  • If the underlying ServicePoint connection isn't made available back to WinHttpHandler in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler : Timed out sending report to satellite

If you agree, I would propose having a WithConcurrentReports option that links to a SemaphoreSlim & connection limit on the HttpClientHandler, so that NET Framework consumers can control & match outgoing concurrency safely.

@austinlparker
Copy link
Contributor

Great idea! I can take a stab at this tomorrow. Thank you for the in-depth issue as well!

@austinlparker
Copy link
Contributor

once again, .net 4.5 is my enemy - seems like the only option for setting max connections is modifying servicepointmanager, which hits the entire appdomain (unless there's something extremely obvious I'm missing). I could always set it to whatever value is passed in through WithConcurrentReports and change it back, or I could allow you to pass in your own handler (so you could tweak whatever values you want on HttpClient)

@hotchkj
Copy link
Author

hotchkj commented Aug 14, 2020

Ah yes, NET 4.5 does make life difficult. Unfortunately ServicePointManager doesn't actually apply until the first HTTP call is made, so I think for Framework usage where this is most relevant, it'd be best to leave that in the caller's hands.

One option to consider is building against net471 & netstandard2.1 as well and then only in those targets set HttpClientHandler.MaxConnectionsPerServer since it's available in those targets. Naturally there's overhead of extra targets to test & bundle in the NuGet package, so I don't know how palatable that is.

@austinlparker
Copy link
Contributor

Let me check around internally and see if I can get us off net45, or at least leave it on an older version of the library.

@austinlparker
Copy link
Contributor

Circling back on this, since I haven't really gotten a great answer - what if I let you pass in your own httpclient?

@hotchkj
Copy link
Author

hotchkj commented Aug 25, 2020

@austinlparker Funnily enough we've been researching that. The existing code is already pretty close to allowing this sort of flexibility. I think some minor modifications might make it easier.

Here's an approximation of what I've been trying out:

...
// Normal tracer code in DI
var httpClient = sp.GetRequiredService<ILightStepHttpClient>();
// Swapping the order of these would make extension that bit nicer
return new Tracer(options, new LightStepSpanRecorder(), httpClient);
...

public class CustomLightstepHttpClient : ILightStepHttpClient
{
  private readonly LightStepHttpClient _defaultImpl;
  private readonly ICustomLightstepClient _client;

  public CustomLightstepHttpClient(ICustomLightstepClient client, LightStep.Options options)
  {
    Check.IsNotNull(client, nameof(client));

    _client = client;
    
    // Has to forward on to the default implementation because I only want to override SendReport
    _defaultImpl = new LightStepHttpClient("hopeUrlNotUsedInTranslate", options);
  }

  public Task<ReportResponse> SendReport(ReportRequest report)
  {
    return _client.SendReport(report);
  }

  public ReportRequest Translate(ISpanRecorder spanBuffer)
  {
    // Ideally this would be a separate concern & interface
    return _defaultImpl.Translate(spanBuffer);
  }
}

public sealed class CustomLightstepClient : ICustomLightstepClient
{
  public async Task<ReportResponse> SendReport(ReportRequest report)
  {
    // Base address is easy to compute from Options
    // Has to hardcode "Lightstep-Access-Token" header, ideally added to LightStepConstants
    // I can impose as much or as little control over things like concurrency setup as I want in here
    // Lets' say my service requires all outgoing HTTP to have metrics recorded or Envoy mesh or other complexities, I can add that here
    
    using (var response = await _client.BuildRequest(nameof(SendReport))
                                       .POST
                                       .AddRelativeUrl(LightStepConstants.SatelliteReportPath)
                                       .SendingBody(() => new ByteArrayContent(report.ToByteArray()))
                                       .WithContentType(new MediaTypeHeaderValue("application/octet-stream"))
                                       .WithHeader("Lightstep-Access-Token", report.Auth.AccessToken)
                                       .GetResponseAsStreamAsync(CancellationToken.None).ConfigureAwait(false))
    {
      return ReportResponse.Parser.ParseFrom(response.Body);
    }
  }
}

With a separate interface for Translate and a constant for the Lightstep-Access-Token header, I think this would allow integrators complete flexibility over how to make the HTTP calls.

Hope that helps explain the direction we might want to take.

@austinlparker
Copy link
Contributor

Take a look at #94 and see what you think - it should make it easier to do what you're looking for here.

@hotchkj
Copy link
Author

hotchkj commented Aug 25, 2020

@austinlparker That looks great, thank you for the quick turnaround!

@austinlparker
Copy link
Contributor

Closed by #94 and 0.10.0 release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants