Skip to content
This repository has been archived by the owner on Apr 11, 2019. It is now read-only.

[MCC-261721] Don't break on 128bit X-B3-TraceId by tossing high bits #100

Merged
merged 8 commits into from
Nov 17, 2016

Conversation

bvillanueva-mdsol
Copy link
Contributor

@bvillanueva-mdsol bvillanueva-mdsol commented Nov 2, 2016

Information about the problem: #96

This PR will prevent the zipkin client to create a new trace if it encounters a 128 bit trace id. It will accept the trace but only recognizes the lower 16 characters (64 bit).

@kenyamat @lschreck-mdsol @hkurniawan-mdsol @jcarres-mdsol
Please review and merge

FYI
@adriancole

Copy link

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some notes on the unit test. thanks!

@@ -218,5 +218,91 @@ public void GetNext()
Assert.AreEqual(sut.SpanId, nextTraceProvider.ParentSpanId);
Assert.AreEqual(sut.IsSampled, nextTraceProvider.IsSampled);
}

[TestMethod]
public void IdsWithLessThan32Characters()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LessThan16?

}

[TestMethod]
public void IdsWith32Characters()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With16?

}

[TestMethod]
public void IdsWith64Characters()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With32?

var fixture = new Fixture();
var traceIdLower16Chars = "48485a3953bb6124";
var traceId = "18485a3953bb6124" + traceIdLower16Chars;
var spanIdLower16Chars = "48485a3953bb6125";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please only consider the trace id, not the span or parent span ids, as spans within a trace are always 64-bit

@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole Thanks for the comments. Please check out new updates addressing the comments f9dc4ff. Only the traceId is being "sanitized" to get lower 16 characters of the incoming traceid. Thanks!

@bvillanueva-mdsol
Copy link
Contributor Author

Please merge if ok 👍

@codefromthecrypt
Copy link

heh I don't have merge rights. merge away!

private string GetLower16Characters(string value)
{
if (string.IsNullOrWhiteSpace(value)) return value;
return value.Length > 16 ? value.Substring(value.Length - 16) : value;
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, is this trying to return the first or the last 16 characters?
E.g., if value = "abcdefghijklmnopq", what should be returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will return the last 16 🎆

@jcarres-mdsol
Copy link
Contributor

This will accept 128bits and propagate 64.
I think it is better to accept both and propagate whatever we get so eventually when 128 go throught the system, everything is 128

@bvillanueva-mdsol
Copy link
Contributor Author

bvillanueva-mdsol commented Nov 7, 2016

@adriancole @jcarres-mdsol Updated code 8247ac4
Changes:

  • Zipkin client will accept and propagate 128 bit trace id
  • when trace is prepared for to go inside the queue to be sent to zipkin server, lower 16 characters is only considered.

Please review and merge if you are cool with these changes. Thanks!

FYI @kenyamat @hkurniawan-mdsol @lschreck-mdsol

@jcarres-mdsol
Copy link
Contributor

The zipkin server is ready to accept 128bits or 64 bits. Just send anything you get

@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole @jcarres-mdsol
Question:

  1. When sending a request to server with these conditions:
    path: /api/v1/spans
    method "POST"
    traceId: 128bit

Is the server expecting a hex string for the traceid part?

FYI: @lschreck-mdsol @kenyamat @hkurniawan-mdsol

@jcarres-mdsol
Copy link
Contributor

@bvillanueva-mdsol good question, the new api doc is not published it seems but I would expect it to be the same format than now
BTW, you can explore the api docs here
http://zipkin.io/zipkin-api/#/paths/%252Fspans/post/parameters/span

@jcarres-mdsol
Copy link
Contributor

jcarres-mdsol commented Nov 14, 2016

The new docs when published will say this 8-byte or 16-byte identifier respectively encoded as 16 or 32 lowercase hex characters.

@codefromthecrypt
Copy link

yep same hex encoding. only thing that changes is length (in json/http
headers anyway)

On Sun, Nov 13, 2016 at 10:49 PM, Jordi Polo Carres <
[email protected]> wrote:

The new docs when published will say this "'8-byte or 16-byte identifier
respectively encoded as 16 or 32 lowercase hex characters.'"


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#100 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAD618_Sx7aZlJL__3vSy2UkTy6NVyo9ks5q-AR9gaJpZM4Kmz1u
.

@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole @jcarres-mdsol Thanks, Im trying to change the implementation to send 128 or 64 bit to zipkin server. Will update you guys if it is done :)

@codefromthecrypt
Copy link

great. so initiating128-bit traces should probably be a config flag,
propagating them shouldn't require config.

ex in brave we say traceId128Bit(true) to start traces as 128-bit
(32-character length in headers/json or set traceIdHigh in thrift)

openzipkin/b3-propagation#6

On Sun, Nov 13, 2016 at 11:48 PM, Brent Villanueva
[email protected] wrote:

@adriancole @jcarres-mdsol Thanks, Im trying to change the implementation to
send 128 or 64 bit to zipkin server. Will update you guys if it is done :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@@ -33,6 +33,7 @@ service/environment.
- `NotToBeDisplayedDomainList`(optional) - It will be used when logging host name by excluding these strings in service name attribute
e.g. domain: ".xyz.com", host: "abc.xyz.com" will be logged as "abc" only
- `ExcludedPathList`(optional) - Path list that is not needed for tracing. Each item must start with "/".
- `Create128BitTraceId` - Create new traces using 128 bit (32 hex character) traceId
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcarres-mdsol @adriancole Added bool config Create128BitTraceId to control client if we need to create 128bit or 64 bit traceid

FYI @kenyamat @hkurniawan-mdsol

internal static bool IsParsableToGuid(this string value)
{
Guid result;
return Guid.TryParse(value, out result);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenyamat @hkurniawan-mdsol This is the only way I researched that is convenient to check if hex string is a valid 128 bit value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this is normal hex format (16 or 32 char) as opposed to UUID format (with hyphens), In general, I'd avoid GUID as usually it implies the latter (which will break people)


namespace Medidata.ZipkinTracer.Models
{
public class Span
{
public long TraceId { get; set; }
public string TraceId { get; set; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kenyamat @hkurniawan-mdsol @jcarres-mdsol I discovered that the long type is not being used. Since TraceProvider already checked if this values are already correct and valid, these values may remain as strings and can be sent as is (string) to zipkin server.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah cool

Copy link

@codefromthecrypt codefromthecrypt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good as long as it creates trace ids in the correct format (16 or 32 lowerhex characters without hyphens).

It would be nice to see tests using a regular expression or something similar to prove this is the cas.

TraceId = Parse(headerTraceId) ? headerTraceId : GenerateHexEncodedInt64FromNewGuid();
SpanId = Parse(headerSpanId) ? headerSpanId : TraceId;
ParentSpanId = Parse(headerParentSpanId) ? headerParentSpanId : string.Empty;
TraceId = headerTraceId.IsParsableTo128Or64Bit() ? headerTraceId : GenerateHexFromNewGuid(config.Create128BitTraceId);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to check the validity of the trace id here (IsParsableTo128Or64Bit).
The only thing that we need to check is whether headerTraceId is null or empty and generate a new one if it does.
Generating a new one due to invalid will break the request flow.
We should just give it to Zipkin server and if it blows up we can at least trace it back to the original request that created the invalid trace id.

What do you think @adriancole, @jcarres-mdsol, @kenyamat, @lschreck-mdsol?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriancole Let me just move your comment under this thread.

my 2p is that unit tests should be where the code exists. punting all the
way to the server to tell if the id is well formatted seems against that
principle.

How would you track the request back to the original request/service call?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there's two things going on:

  1. GenerateHexFromNewGuid should be tested particularly that the exact formatting is known (as opposed to presumed). This was the point I was making about ensuring it doesn't have hyphens etc.
  2. It is pretty common to either restart a trace when headers passed to you malformed, or silently generate a new one. I've seen it done both ways and there are valid arguments some of them captured here: Define what behavior is expected when X-B3-TraceId is present, but not X-B3-SpanId openzipkin/b3-propagation#3 (comment)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one way out is to have a flag as to whether to break or restart on malformed ID. For example, in spring-cloud-sleuth we rolled back breaking on propagation error (vs restarting the trace) because the middleware that had the bug couldn't be updated with a fix fast enough and breaking caused production failures. Such bugs are rare, though, so not necessarily the best answer for everyone.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on 1, I think I just saw a commit on this.

Regarding 2, I'm not suggesting the middleware to throw exceptions.
I think the middleware should just accept whatever trace id it receives and sends it to Zipkin server. Errors will occur on the server but this should not affect requests and we can see the log (if any) on who initiated the malformed id.

This is what I worry about.
image
Say that the trace id is malformed when calling Svc 1 and a new one is created.
Apparently the total call of Service A violates the SLA.
How can we determine which service under Service A that causes the slowdown?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS you can go ahead and test this, too. just start zipkin and send bad
trace ids to its endpoint. Nothing will show up, so you will have no means
to debug the diagram mentioned.

Will try adhoc-ing this with @bvillanueva-mdsol.

Seems like Zipkin is not logging or returning BadRequest when it receives malformed data.
In that case re-starting the trace is a solution.
This means by swallowing the malformed id we would never know that we have a bug somewhere and we would have requests with broken traces.
If everyone is okay with this I'm okay as well.

@codefromthecrypt
Copy link

my 2p is that unit tests should be where the code exists. punting all the
way to the server to tell if the id is well formatted seems against that
principle.

On Wed, Nov 16, 2016 at 10:15 AM, Herry Kurniawan [email protected]
wrote:

@hkurniawan-mdsol commented on this pull request.

In src/Medidata.ZipkinTracer.Core/TraceProvider.cs
#100 (review)
:

@@ -72,11 +73,11 @@ internal TraceProvider(IZipkinConfig config, IOwinContext context = null)
headerSampled = context.Request.Headers[SampledHeaderName];
}

  •        TraceId = Parse(headerTraceId) ? headerTraceId : GenerateHexEncodedInt64FromNewGuid();
    
  •        SpanId = Parse(headerSpanId) ? headerSpanId : TraceId;
    
  •        ParentSpanId = Parse(headerParentSpanId) ? headerParentSpanId : string.Empty;
    
  •        TraceId = headerTraceId.IsParsableTo128Or64Bit() ? headerTraceId : GenerateHexFromNewGuid(config.Create128BitTraceId);
    

I don't think we need to check the validity of the trace id here (
IsParsableTo128Or64Bit).
The only thing that we need to check is whether headerTraceId is null or
empty and generate a new one if it does.
Generating a new one due to invalid will break the request flow.
We should just give it to Zipkin server and if it blows up we can at least
trace it back to the original request that created the invalid trace id.

What do you think @adriancole https://github.com/adriancole,
@jcarres-mdsol https://github.com/jcarres-mdsol, @kenyamat
https://github.com/kenyamat, @lschreck-mdsol
https://github.com/lschreck-mdsol?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#100 (review),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAD61z55jlD-RgDYAafL5navod7GBw52ks5q-mdbgaJpZM4Kmz1u
.

@h-kurniawan h-kurniawan reopened this Nov 16, 2016
@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole Updated unit test to check lowercase hex strings are generated from traceprovider
05c90b6 8d222f7

@codefromthecrypt
Copy link

Regarding 2, I'm not suggesting the middleware to throw exceptions.
I think the middleware should just accept whatever trace id it receives
and sends it to Zipkin server. Errors will occur on the server but this
should not affect requests and we can see the log (if any) on who initiated
the malformed id.

This is what I worry about.
[image: image]
https://cloud.githubusercontent.com/assets/21052449/20336050/b75ea3e8-ac0a-11e6-95ea-9a8b285cab2a.png
Say that the trace id is malformed when calling Svc 1 and a new one is
created.
Apparently the total call of Service A violates the SLA.
How can we determine which service under Service A that causes the
slowdown?

If a malformed ID is sent to the server, then you won't see trace at all as
the span will be dropped. Your only signal would be a metric increment for
span drop (due to parse failure). Restarting the trace allows you to see
the operation even if its predecessor was buggy. always encoding malformed
IDs will result in seeing nothing.

@codefromthecrypt
Copy link

codefromthecrypt commented Nov 16, 2016 via email


namespace Medidata.ZipkinTracer.Core.Test
{
[TestClass]
public class TraceProviderTests
{
private const string regex128BitPattern = @"^[a-f0-9]{32}$";
private const string regex64BitPattern = @"^[a-f0-9]{16}$";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@codefromthecrypt
Copy link

ps feel free to hop on https://gitter.im/openzipkin/zipkin if you like (could be quicker).

Seems the idea here is to have a 3rd option, here's all three

  • restart on malformed
  • break caller on malformed
  • trust (propagate and report transparently)

In ideal case, the third scenario is best, since upstream is written properly, there are no problems.

Ex. let's say we are using the trust approach and service A calls B calls C

  • service A chooses the trace id format and this is used as is when reporting (with json)
  • if service A writes a correct format, if B and C report via json, they might use those values directly (depending on how they encode things).
  • So, we are saying each downstream service blindly trusts and reports on basis of the trace ID format propagated to them upstream.

In ideal case, this doesn't have any impact, as the correct format is written in the first place, when zipkin gets it, it can parse it. We have seen bugs where people use invalid format, such as signed numbers and even an - sign by accident. These bugs are rare and easily fixed, but here's the impact on exception.

  • service A chooses and invalid traceId, like abs-123-💩, and passes it via X-B3-TraceId
    • when service A reports the span, the zipkin collector drops it, because abs-123-💩 is malformed.
  • service B uses the trustful approach and makes a new span id without looking at the trace id abs-123-💩. It propagates that downstream
    • when service B reports its span, the zipkin collector drops it, because abs-123-💩 is malformed.
  • this repeats for each service that trusts the trace id without looking at it. Nothing ends up in zipkin, as all spans were dropped

Alternatively, let's say we restart a trace when its ID is malformed.

  • service A chooses and invalid traceId, like abs-123-💩, and passes it via X-B3-TraceId
    • when service A reports the span, the zipkin collector drops it, because abs-123-💩 is malformed.
  • service B looks at abs-123-💩 and sees it is malformed, restarts the trace with a new ID.
    • when service B reports its span, it appears as the root span, so you don't see service A directly

When we restart, not only do we at least see a partial trace, but we also can increment a counter or log the calling context of who propagated a malformed ID. This would include any metadata in the HTTP request. This could help identify and troubleshoot service A faster than seeing no data at all.

@codefromthecrypt
Copy link

Will try adhoc-ing this with @bvillanueva-mdsol
https://github.com/bvillanueva-mdsol.

Seems like Zipkin is not logging or returning BadRequest when it receives
malformed data.
In that case re-starting the trace is a solution.
This means by swallowing the malformed id we would never know that we have
a bug somewhere and we would have requests with broken traces.
If everyone is okay with this I'm okay as well.

It is worse than this, actually. In most libraries, reporting to zipkin is
asynchronous, which means the call that had the malformed ID already took
place, sometime later it is sent to zipkin. Moreover, many users send via
Kafka or other async transports which would offer no up-front validation.

@codefromthecrypt
Copy link

Seems like Zipkin is not logging or returning BadRequest when it receives
malformed data.

if you are running latest zipkin, it does actually show an error when
there's malformed data. I think this was recently noticed.

$ curl -s localhost:9411/api/v1/spans -H'Content-Type: application/json' -d
'hello world'

Malformed reading List from json: hello world

Also, there's a counter..

$ curl -s localhost:9411/metrics|jq
'."counter.zipkin_collector.messages_dropped.http"'

2

@bvillanueva-mdsol
Copy link
Contributor Author

bvillanueva-mdsol commented Nov 16, 2016

It is worse than this, actually. In most libraries, reporting to zipkin is
asynchronous, which means the call that had the malformed ID already took
place, sometime later it is sent to zipkin. Moreover, many users send via
Kafka or other async transports which would offer no up-front validation.

@adriancole @hkurniawan-mdsol The same situation with zipkin dot net client as it is submitted later on another thread. Spans are submitted by batch.

@codefromthecrypt
Copy link

It is worse than this, actually. In most libraries, reporting to zipkin is
asynchronous, which means the call that had the malformed ID already took
place, sometime later it is sent to zipkin. Moreover, many users send via
Kafka or other async transports which would offer no up-front validation.

@adriancole https://github.com/adriancole @hkurniawan-mdsol
https://github.com/hkurniawan-mdsol The same situation with zipkin dot
net client as it is submitted later on another thread. Spans are submitted
by batch

ok yeah, so in this case, even if you look at the http response from
zipkin in the batch sender, you have less context about the caller that
created one of the spans, unless in your zipkin span you are logging
CLIENT_ADDR(ca). Still debuggable, but a bit more tricky vs linting at the
point of propagation.

@h-kurniawan
Copy link

@adriancole, @bvillanueva-mdsol just adhoc-ed with malformed trace id and we get 400 BadRequest back from Zipkin.

Failed to send spans to Zipkin server (HTTP status code returned: BadRequest). Exception message: The remote server returned an error: (400) Bad Request., response from server: Cannot decode spans due to IllegalArgumentException(47318d050e42707x should be a 1 to 32 character lower-hex string with no prefix reading List<Span> from json: [{"traceId":"47318d050e42707x","name":"get","id":"452b4857980c7431","annotations":[{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"value":"sr","timestamp":1479284551122624},{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"value":"ss","timestamp":1479284551132626}],"binaryAnnotations":[{"endpoint":{"ipv4":"192.168.56.1","port":12008,"serviceName":"localhost"},"key":"http.path","value":"/help_center"}]}])

This actually throws an exception on ZipkinTracerModule which being catched and logged to presumably SumoLogic.
Since we are logging these errors so generating trace id should not be a problem.

@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole @hkurniawan-mdsol @kenyamat @jcarres-mdsol Will go for regenerating new traceid if client encounters malformed traceid. Ill take out NRTM (Not ready for merge) tag if I am finish adhocing it on two services. Thanks!

@codefromthecrypt
Copy link

good luck and thanks for all the effort here!

@bvillanueva-mdsol
Copy link
Contributor Author

bvillanueva-mdsol commented Nov 17, 2016

Adhoc-ed the changes successfully.
Checked that server will accept 128 bit as well as 64 bit. and the client propagates the traceid (128 or 64 bit)

@jcarres-mdsol @kenyamat @hkurniawan-mdsol This is ready for merge. Please merge if it is ok. Thanks!

@adriancole Thanks for your input in this PR.

@kenyamat
Copy link
Contributor

👍 👍 👍

@kenyamat kenyamat merged commit 0965857 into develop Nov 17, 2016
@bvillanueva-mdsol bvillanueva-mdsol deleted the feature/accept128bit-ids branch November 17, 2016 02:05
@codefromthecrypt
Copy link

codefromthecrypt commented Nov 17, 2016 via email

@bvillanueva-mdsol
Copy link
Contributor Author

bvillanueva-mdsol commented Nov 17, 2016

@adriancole will let you know. I'll be submitting another PR to solve the problem on "X-B3-Sampled" to accept "1" as true and "0" as false. Currently library only supports "true" and "false".

Thanks!

@bvillanueva-mdsol
Copy link
Contributor Author

@adriancole

We released a new zipkin dotnet client nuget.
Please check out release page
https://github.com/mdsol/Medidata.ZipkinTracerModule/releases/tag/v3.2.0
and nuget page
https://www.nuget.org/packages/Medidata.ZipkinTracer.Core/3.2.0

Thanks,
Brent

@codefromthecrypt
Copy link

fantastic. thanks!

On Mon, Nov 21, 2016 at 1:23 PM, Brent Villanueva [email protected]
wrote:

@adriancole https://github.com/adriancole

We released a new zipkin dotnet client nuget.
Please check out release page
https://github.com/mdsol/Medidata.ZipkinTracerModule/releases/tag/v3.2.0
and nuget page
https://www.nuget.org/packages/Medidata.ZipkinTracer.Core/3.2.0

Thanks,
Brent


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#100 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAD612aTdDYh1UApbhKZGYcu-L8GzkHJks5rASrugaJpZM4Kmz1u
.

@codefromthecrypt
Copy link

@jcarres-mdsol do you think you'll have cycles for similar support for zipkin-ruby?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants