-
Notifications
You must be signed in to change notification settings - Fork 20
[MCC-261721] Don't break on 128bit X-B3-TraceId by tossing high bits #100
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some notes on the unit test. thanks!
@@ -218,5 +218,91 @@ public void GetNext() | |||
Assert.AreEqual(sut.SpanId, nextTraceProvider.ParentSpanId); | |||
Assert.AreEqual(sut.IsSampled, nextTraceProvider.IsSampled); | |||
} | |||
|
|||
[TestMethod] | |||
public void IdsWithLessThan32Characters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LessThan16?
} | ||
|
||
[TestMethod] | ||
public void IdsWith32Characters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With16?
} | ||
|
||
[TestMethod] | ||
public void IdsWith64Characters() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With32?
var fixture = new Fixture(); | ||
var traceIdLower16Chars = "48485a3953bb6124"; | ||
var traceId = "18485a3953bb6124" + traceIdLower16Chars; | ||
var spanIdLower16Chars = "48485a3953bb6125"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please only consider the trace id, not the span or parent span ids, as spans within a trace are always 64-bit
@adriancole Thanks for the comments. Please check out new updates addressing the comments f9dc4ff. Only the traceId is being "sanitized" to get lower 16 characters of the incoming traceid. Thanks! |
Please merge if ok 👍 |
heh I don't have merge rights. merge away! |
private string GetLower16Characters(string value) | ||
{ | ||
if (string.IsNullOrWhiteSpace(value)) return value; | ||
return value.Length > 16 ? value.Substring(value.Length - 16) : value; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question, is this trying to return the first or the last 16 characters?
E.g., if value = "abcdefghijklmnopq", what should be returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will return the last 16 🎆
This will accept 128bits and propagate 64. |
@adriancole @jcarres-mdsol Updated code 8247ac4
Please review and merge if you are cool with these changes. Thanks! FYI @kenyamat @hkurniawan-mdsol @lschreck-mdsol |
The zipkin server is ready to accept 128bits or 64 bits. Just send anything you get |
@adriancole @jcarres-mdsol
Is the server expecting a hex string for the traceid part? FYI: @lschreck-mdsol @kenyamat @hkurniawan-mdsol |
@bvillanueva-mdsol good question, the new api doc is not published it seems but I would expect it to be the same format than now |
The new docs when published will say this |
yep same hex encoding. only thing that changes is length (in json/http On Sun, Nov 13, 2016 at 10:49 PM, Jordi Polo Carres <
|
@adriancole @jcarres-mdsol Thanks, Im trying to change the implementation to send 128 or 64 bit to zipkin server. Will update you guys if it is done :) |
great. so initiating128-bit traces should probably be a config flag, ex in brave we say traceId128Bit(true) to start traces as 128-bit On Sun, Nov 13, 2016 at 11:48 PM, Brent Villanueva
|
@@ -33,6 +33,7 @@ service/environment. | |||
- `NotToBeDisplayedDomainList`(optional) - It will be used when logging host name by excluding these strings in service name attribute | |||
e.g. domain: ".xyz.com", host: "abc.xyz.com" will be logged as "abc" only | |||
- `ExcludedPathList`(optional) - Path list that is not needed for tracing. Each item must start with "/". | |||
- `Create128BitTraceId` - Create new traces using 128 bit (32 hex character) traceId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jcarres-mdsol @adriancole Added bool config Create128BitTraceId to control client if we need to create 128bit or 64 bit traceid
FYI @kenyamat @hkurniawan-mdsol
internal static bool IsParsableToGuid(this string value) | ||
{ | ||
Guid result; | ||
return Guid.TryParse(value, out result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenyamat @hkurniawan-mdsol This is the only way I researched that is convenient to check if hex string is a valid 128 bit value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this is normal hex format (16 or 32 char) as opposed to UUID format (with hyphens), In general, I'd avoid GUID as usually it implies the latter (which will break people)
|
||
namespace Medidata.ZipkinTracer.Models | ||
{ | ||
public class Span | ||
{ | ||
public long TraceId { get; set; } | ||
public string TraceId { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kenyamat @hkurniawan-mdsol @jcarres-mdsol I discovered that the long type is not being used. Since TraceProvider already checked if this values are already correct and valid, these values may remain as strings and can be sent as is (string) to zipkin server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah cool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good as long as it creates trace ids in the correct format (16 or 32 lowerhex characters without hyphens).
It would be nice to see tests using a regular expression or something similar to prove this is the cas.
TraceId = Parse(headerTraceId) ? headerTraceId : GenerateHexEncodedInt64FromNewGuid(); | ||
SpanId = Parse(headerSpanId) ? headerSpanId : TraceId; | ||
ParentSpanId = Parse(headerParentSpanId) ? headerParentSpanId : string.Empty; | ||
TraceId = headerTraceId.IsParsableTo128Or64Bit() ? headerTraceId : GenerateHexFromNewGuid(config.Create128BitTraceId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to check the validity of the trace id here (IsParsableTo128Or64Bit
).
The only thing that we need to check is whether headerTraceId
is null or empty and generate a new one if it does.
Generating a new one due to invalid will break the request flow.
We should just give it to Zipkin server and if it blows up we can at least trace it back to the original request that created the invalid trace id.
What do you think @adriancole, @jcarres-mdsol, @kenyamat, @lschreck-mdsol?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriancole Let me just move your comment under this thread.
my 2p is that unit tests should be where the code exists. punting all the
way to the server to tell if the id is well formatted seems against that
principle.
How would you track the request back to the original request/service call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there's two things going on:
- GenerateHexFromNewGuid should be tested particularly that the exact formatting is known (as opposed to presumed). This was the point I was making about ensuring it doesn't have hyphens etc.
- It is pretty common to either restart a trace when headers passed to you malformed, or silently generate a new one. I've seen it done both ways and there are valid arguments some of them captured here: Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId openzipkin/b3-propagation#3 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one way out is to have a flag as to whether to break or restart on malformed ID. For example, in spring-cloud-sleuth we rolled back breaking on propagation error (vs restarting the trace) because the middleware that had the bug couldn't be updated with a fix fast enough and breaking caused production failures. Such bugs are rare, though, so not necessarily the best answer for everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on 1
, I think I just saw a commit on this.
Regarding 2
, I'm not suggesting the middleware to throw exceptions.
I think the middleware should just accept whatever trace id it receives and sends it to Zipkin server. Errors will occur on the server but this should not affect requests and we can see the log (if any) on who initiated the malformed id.
This is what I worry about.
Say that the trace id is malformed when calling Svc 1
and a new one is created.
Apparently the total call of Service A
violates the SLA.
How can we determine which service under Service A
that causes the slowdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS you can go ahead and test this, too. just start zipkin and send bad
trace ids to its endpoint. Nothing will show up, so you will have no means
to debug the diagram mentioned.
Will try adhoc-ing this with @bvillanueva-mdsol.
Seems like Zipkin is not logging or returning BadRequest
when it receives malformed data.
In that case re-starting the trace is a solution.
This means by swallowing the malformed id we would never know that we have a bug somewhere and we would have requests with broken traces.
If everyone is okay with this I'm okay as well.
my 2p is that unit tests should be where the code exists. punting all the On Wed, Nov 16, 2016 at 10:15 AM, Herry Kurniawan [email protected]
|
… with hyphens (Guid format)
@adriancole Updated unit test to check lowercase hex strings are generated from traceprovider |
|
PS you can go ahead and test this, too. just start zipkin and send bad
trace ids to its endpoint. Nothing will show up, so you will have no means
to debug the diagram mentioned.
|
|
||
namespace Medidata.ZipkinTracer.Core.Test | ||
{ | ||
[TestClass] | ||
public class TraceProviderTests | ||
{ | ||
private const string regex128BitPattern = @"^[a-f0-9]{32}$"; | ||
private const string regex64BitPattern = @"^[a-f0-9]{16}$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
ps feel free to hop on https://gitter.im/openzipkin/zipkin if you like (could be quicker). Seems the idea here is to have a 3rd option, here's all three
In ideal case, the third scenario is best, since upstream is written properly, there are no problems. Ex. let's say we are using the trust approach and service A calls B calls C
In ideal case, this doesn't have any impact, as the correct format is written in the first place, when zipkin gets it, it can parse it. We have seen bugs where people use invalid format, such as signed numbers and even an
Alternatively, let's say we restart a trace when its ID is malformed.
When we restart, not only do we at least see a partial trace, but we also can increment a counter or log the calling context of who propagated a malformed ID. This would include any metadata in the HTTP request. This could help identify and troubleshoot service A faster than seeing no data at all. |
|
$ curl -s localhost:9411/api/v1/spans -H'Content-Type: application/json' -d Malformed reading List from json: hello world Also, there's a counter.. $ curl -s localhost:9411/metrics|jq 2 |
@adriancole @hkurniawan-mdsol The same situation with zipkin dot net client as it is submitted later on another thread. Spans are submitted by batch. |
|
@adriancole, @bvillanueva-mdsol just adhoc-ed with malformed trace id and we get 400
This actually throws an exception on |
@adriancole @hkurniawan-mdsol @kenyamat @jcarres-mdsol Will go for regenerating new traceid if client encounters malformed traceid. Ill take out NRTM (Not ready for merge) tag if I am finish adhocing it on two services. Thanks! |
good luck and thanks for all the effort here! |
Adhoc-ed the changes successfully. @jcarres-mdsol @kenyamat @hkurniawan-mdsol This is ready for merge. Please merge if it is ok. Thanks! @adriancole Thanks for your input in this PR. |
👍 👍 👍 |
great! let me know which version this is released as so I can update the
tracking doc on 128-bit support
|
@adriancole will let you know. I'll be submitting another PR to solve the problem on "X-B3-Sampled" to accept "1" as true and "0" as false. Currently library only supports "true" and "false". Thanks! |
We released a new zipkin dotnet client nuget. Thanks, |
fantastic. thanks! On Mon, Nov 21, 2016 at 1:23 PM, Brent Villanueva [email protected]
|
@jcarres-mdsol do you think you'll have cycles for similar support for zipkin-ruby? |
Information about the problem: #96
This PR will prevent the zipkin client to create a new trace if it encounters a 128 bit trace id. It will accept the trace but only recognizes the lower 16 characters (64 bit).
@kenyamat @lschreck-mdsol @hkurniawan-mdsol @jcarres-mdsol
Please review and merge
FYI
@adriancole