-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent timeouts sending data to satellites when clients are under heavy load #92
Comments
Great idea! I can take a stab at this tomorrow. Thank you for the in-depth issue as well! |
once again, .net 4.5 is my enemy - seems like the only option for setting max connections is modifying servicepointmanager, which hits the entire appdomain (unless there's something extremely obvious I'm missing). I could always set it to whatever value is passed in through |
Ah yes, NET 4.5 does make life difficult. Unfortunately One option to consider is building against |
Let me check around internally and see if I can get us off net45, or at least leave it on an older version of the library. |
Circling back on this, since I haven't really gotten a great answer - what if I let you pass in your own httpclient? |
@austinlparker Funnily enough we've been researching that. The existing code is already pretty close to allowing this sort of flexibility. I think some minor modifications might make it easier. Here's an approximation of what I've been trying out: ...
// Normal tracer code in DI
var httpClient = sp.GetRequiredService<ILightStepHttpClient>();
// Swapping the order of these would make extension that bit nicer
return new Tracer(options, new LightStepSpanRecorder(), httpClient);
...
public class CustomLightstepHttpClient : ILightStepHttpClient
{
private readonly LightStepHttpClient _defaultImpl;
private readonly ICustomLightstepClient _client;
public CustomLightstepHttpClient(ICustomLightstepClient client, LightStep.Options options)
{
Check.IsNotNull(client, nameof(client));
_client = client;
// Has to forward on to the default implementation because I only want to override SendReport
_defaultImpl = new LightStepHttpClient("hopeUrlNotUsedInTranslate", options);
}
public Task<ReportResponse> SendReport(ReportRequest report)
{
return _client.SendReport(report);
}
public ReportRequest Translate(ISpanRecorder spanBuffer)
{
// Ideally this would be a separate concern & interface
return _defaultImpl.Translate(spanBuffer);
}
}
public sealed class CustomLightstepClient : ICustomLightstepClient
{
public async Task<ReportResponse> SendReport(ReportRequest report)
{
// Base address is easy to compute from Options
// Has to hardcode "Lightstep-Access-Token" header, ideally added to LightStepConstants
// I can impose as much or as little control over things like concurrency setup as I want in here
// Lets' say my service requires all outgoing HTTP to have metrics recorded or Envoy mesh or other complexities, I can add that here
using (var response = await _client.BuildRequest(nameof(SendReport))
.POST
.AddRelativeUrl(LightStepConstants.SatelliteReportPath)
.SendingBody(() => new ByteArrayContent(report.ToByteArray()))
.WithContentType(new MediaTypeHeaderValue("application/octet-stream"))
.WithHeader("Lightstep-Access-Token", report.Auth.AccessToken)
.GetResponseAsStreamAsync(CancellationToken.None).ConfigureAwait(false))
{
return ReportResponse.Parser.ParseFrom(response.Body);
}
}
} With a separate interface for Hope that helps explain the direction we might want to take. |
Take a look at #94 and see what you think - it should make it easier to do what you're looking for here. |
@austinlparker That looks great, thank you for the quick turnaround! |
Closed by #94 and 0.10.0 release |
Our satellite is recording a significant number of dropped spans from a deployed NET client, and after digging through the code, I believe I've hit on a potential bug in the tracing code, and would like feedback/thoughts on whether this is a valid concern.
What I observe is that the tracer has no mechanism to prevent
ServicePoint.ConnectionLimit
being exceeded, and if we end up with, say 10-20 Flushes in progress with only 2 HTTP connections to process them, many of them may time out and then have their spans dropped.A timeout visible on the client via proxy might look something like:
Example request:
Satellite response recorded by proxy 30 seconds later:
This looks to be a problem if sending data to the satellite takes a while and there are a number of concurrent Flushes in progress - it's only been observed when the clients are very busy with lots of spans. It seems unlikely that it takes >30 seconds to send 509 bytes of body yet the same system can send 221 header characters just fine before that point. My reasoning is as follows.
Preamble:
ServicePoint
(per base address) which has aConnectionLimit
property (defaults to 2 courtesy ofServicePointManager.DefaultConnectionLimit
default)WinHttpHandler
sends request header & body as two separate Win32 instructionsThe tracing
Flush
is run vialightstep-tracer-csharp/src/LightStep/Tracer.cs
Line 71 in bdb38ba
I believe there's a subtlety in the C# tracing code that can lead to intermittent timeouts which gets worse the greater the load:
Timer
means that every Report Period (default 5 seconds), a newFlush
is called i.e. NFlush
operations can occur concurrentlyFlush
callsHttpClient.SendAsync
which will take some time to complete, so there can be N attempts to make an HTTP call in flight at once, getting worse the more spans occurring & shorter reporting periods chosenServicePoint
connection isn't made available back toWinHttpHandler
in time, the C# operation will time out due to the internal cancellation token, which is exactly what I see reported in the associated exception handler :Timed out sending report to satellite
If you agree, I would propose having a
WithConcurrentReports
option that links to aSemaphoreSlim
& connection limit on theHttpClientHandler
, so that NET Framework consumers can control & match outgoing concurrency safely.The text was updated successfully, but these errors were encountered: