-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use local service name as X-Ray subsegment name #88
Conversation
thanks. maybe a unit test to make sure no one breaks this later?
|
I added the test now. One thing related to that: I had to add a test dependency to be able to mock system libraries as described here: https://stackoverflow.com/a/35393623. I hope that's ok! |
I have to admit I don't understand how that test failed :( I haven't touched anything related to SQS publishing. |
That SQS test has been flaking for a month or so. Neither I nor @llinder have had time to fix it it seems |
// using "unknown" subsegment name will help to detect missing names but will also | ||
// result in the service map displaying services depending on an "unknown" one. | ||
if (span.remoteServiceName() == null) { | ||
if ("true".equalsIgnoreCase(System.getenv("AWS_XRAY_USE_LOCAL_SPAN_FOR_MISSING_REMOTES"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this would be the only time we use environment variables directly in java code of a component. Let's not do that. At worst a flag, but this also seems bad semantically. Also, localServiceName can be null in odd situations (not in brave, but possible if people are writing spans themselves).
@devinsba since you use this code what do you think? Is there another option? What's the impact of the graph not having a correct remote service name?
I suppose it would be telling to see a trace where this happens, too.. When thinking about tests, I was actually more thinking that but didn't mention it. Sometimes seeing the trace data can show what options might not be obvious from looking at the code without any context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not a big fan about using environment variables in java code, but as the address of the X-Ray server is passed as AWS_XRAY_DAEMON_ADDRESS
env var I used the same for consistency.
Having unknown
as X-Ray subsegment name creates the issues described here: #87. I believe that unknown segments when looking at a single trace are not a big issue, but having almost all services reported as depending on unknown in the service map is a problem if the main reason for using Zipkin or X-Ray is to have a map of the service dependencies.
I have copy/pasted examples of a trace (received from a Spring Boot application that uses the Spring Cloud stack, OpenFeign and Hystrix) and the related data sent to X-Ray, will do it again here:
First span (the request is received)
{
"traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
"id": "70cbdfe07b779e32",
"kind": "SERVER",
"name": "get",
"timestamp": 1527232127970111,
"duration": 29131,
"localEndpoint": {
"serviceName": "security-gateway-aws",
"ipv4": "192.168.0.171"
},
"remoteEndpoint": {
"ipv4": "127.0.0.1",
"port": 33060
},
"tags": {
"http.method": "GET",
"http.path": "/test-app/v1/public"
}
}
and the related X-Ray segment:
{
"trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
"id": "70cbdfe07b779e32",
"name": "security-gateway-aws",
"start_time": 1.527232127970111E9,
"end_time": 1.527232127999242E9,
"http": {
"request": {
"method": "GET"
}
},
"annotations": {
"operation": "get",
"http_path": "/test-app/v1/public"
}
}
No problems here.
Second span (the service invokes a remote API via OpenFeign)
{
"traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
"parentId": "70cbdfe07b779e32",
"id": "02e913691d305aa0",
"kind": "CLIENT",
"name": "get",
"timestamp": 1527232127972240,
"duration": 23838,
"localEndpoint": {
"serviceName": "security-gateway-aws",
"ipv4": "192.168.0.171"
},
"tags": {
"http.method": "GET",
"http.path": "/test-app/v1/public"
}
}
and the related X-Ray subsegment:
{
"trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
"parent_id": "70cbdfe07b779e32",
"id": "02e913691d305aa0",
"type": "subsegment",
"namespace": "remote",
"name": "unknown",
"start_time": 1.52723212797224E9,
"end_time": 1.527232127996078E9,
"http": {
"request": {
"method": "GET"
}
},
"annotations": {
"operation": "get",
"http_path": "/test-app/v1/public"
}
}
Besides the conceptual issue that this is not a remote call, yet (but Zipkin cannot know this) here the first time the unknown problem appears.
Third span (Hystrix circuit breaker used by OpenFeign)
{
"traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
"parentId": "02e913691d305aa0",
"id": "f3d698d7b353651f",
"name": "hystrix",
"timestamp": 1527232127972476,
"duration": 23595,
"localEndpoint": {
"serviceName": "security-gateway-aws",
"ipv4": "192.168.0.171"
}
}
and the related X-Ray segment:
{
"trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
"parent_id": "02e913691d305aa0",
"id": "f3d698d7b353651f",
"type": "subsegment",
"name": "unknown",
"start_time": 1.527232127972476E9,
"end_time": 1.527232127996071E9
}
This is correctly not reported as remote segment but the name is unknown. Of all I believe this is the most wrong one as it should definitely use the local service name.
Fourth span (the real HTTP client)
{
"traceId": "5b07b67fed3a0b4570cbdfe07b779e32",
"parentId": "f3d698d7b353651f",
"id": "164d63874559fbdf",
"kind": "CLIENT",
"name": "get",
"timestamp": 1527232127973431,
"duration": 22109,
"localEndpoint": {
"serviceName": "security-gateway-aws",
"ipv4": "192.168.0.171"
},
"tags": {
"http.method": "GET",
"http.path": "/test-app/v1/public"
}
}
and the X-Ray segment:
{
"trace_id": "1-5b07b67f-ed3a0b4570cbdfe07b779e32",
"parent_id": "f3d698d7b353651f",
"id": "164d63874559fbdf",
"type": "subsegment",
"namespace": "remote",
"name": "unknown",
"start_time": 1.527232127973431E9,
"end_time": 1.52723212799554E9,
"http": {
"request": {
"method": "GET"
}
},
"annotations": {
"operation": "get",
"http_path": "/test-app/v1/public"
}
}
Same situation as for the second span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering, as the AWS X-Ray documentation (https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html) states that for subsegments the name should be "The logical name of the subsegment. For downstream calls, name the subsegment after the resource or service called. For custom subsegments, name the subsegment after the code that it instruments (e.g., a function name)." should the span name be used as subsegment name for spans that don't have any kind set and use the current unknown (or the local service name) for CLIENT spans only?
I haven't seen this on the graph yet myself actually, we haven't traced anything that we didn't have end to end. (other than AWS SDK, but that sets the remote endpoint correctly) I honestly think I'd prefer to see unknown in the graph, over using the local service name, as that would indicate to me that I have some more code to instrument. I understand not everyone is doing greenfield development so I get huge luxury of knowing all of my applications external touchpoints |
@devinsba That's why I left it optional, as it all depends by the context one is in. For greenfield development or when you're in full control of the production of traces and spans it makes sense to immediately see if you have missed something, but when delegating the instrumentation and report to Zipkin to libraries like Sleuth you just have to accept the way they decided to report (or try to send PRs there). |
sleuth we have control over. all the folks replying here contribute to it.
let's focus on fixing the problem! by raising issues clearly about the
misses in sleuth.
|
…name is not present
I will go on testing and report here my findings. So far two things:
|
1.
X-Ray mandates segment ids inside traces to be unique while
Sleuth/Zipkin expects the CLIENT span from one services and the SERVER span
from the called service to share the same span id. I don't know how to
easily solve this.
add this to your application properties
# don't use the same span ID on the client and server side of an RPC
spring.sleuth.supports-join=false
1.
When using zuul (we use it in our security gateway) every incoming
request forwarded to services produces 2 CLIENT spans in the trace (as in
the example I posted before). I believe one is the ribbon command and the
second is the actual call to the service.
open an issue in sleuth to see if we can work around the instrumentation
miss? A client span shouldn't be a child of a client span even if zipkin's
server code permits this, we can't necessarily make others (like amazon or
google) do the same.
|
I think marcin adjusted the ribbon stuff so to not create duplicate remote spans. Can you verify this? If so I'd like to keep your changes except the special-cased flag.. |
Yes that has been fixed and verified. There's still one thing I'd like to try for those remote calls without remote service in the span: use the span name as segment name and not mark that as remote. As the X-Ray documentation talks about using the name of the function called that would maybe make more sense than using the local service name and mark the segment as remote, but I don't know how does it play out with client responses and such, so I've to try. The best of course would be that the spans would contain the remote service section, I don't know how many places in sleuth would have to be changed to add that to the instrumentation. |
I don't think it is reasonable to expect the "remote service names" for all
client spans in any zipkin instrumentation. For example, it is valid to not
know the name of the downstream service. We don't want to encourage putting
"dummy values" here as that makes the service functionality worse. Also
worse would be losing the signal that it was an RPC call (kind.client)
Of course, if something was incorrectly marked as client, that's a bug...
just we don't want to scrub the client just to make the service diagram
believe there is no missing data.
I'd suggest to use sleuth's span adjuster for the things unsure about (like
scrubbing the span.kind) to basically decouple pressure here
http://cloud.spring.io/spring-cloud-static/spring-cloud-sleuth/2.0.0.RC2/single/spring-cloud-sleuth.html#_customization_of_reported_spans
It seems about time to get some clarifications from amazon (beyond
documentation), too. Could you possibly raise a support ticket on their
side about the mapping when you don't know the end service name?
|
I have contacted now the AWS Business Support explaining the issue, won't make any code change until I get a reply from them. Usually they reply in 24 hours. |
thx!
|
It took a while but I got this reply from AWS: X-Ray does not have zipkin's concept of client vs server spans. It does have the concept of remote segments and inferred segments. When making a downstream call to a service the segment may include a child subsegment with any name and an ID to show the Client side experience (errors, latency). Typically the SDK populates the CNAME of the downstream service in the subsegment name field. The downstream service generates a new segment with the parent_id set to the id of the previous subsegment. The name of this segment is used as the name of the service graph node. If the downstream service does not send any segment with the parent id set to the id of the caller subsegment then an inferred segment is created for the remote subsegment with the name of the segment set to the name of the subsegment. These inferred segments may result into new nodes in the service graph with the name of the subsegment. Although since these nodes are inferred, the statistics of the node will be inferred from the edges. In this case, please let us know why are you sending subsegments with name set to "unknown" and namespace as "remote". The namespace is correct as long as that indicates a RPC. It is not clear why the name of the subsegment is not to the correct name of the downstream service and is being referring to as "unknown". Looking in the link: #59, that has an update by @adriancole which reads: |
it is probably mutually difficult to follow these conversations ( I mean
from AWS and non AWS point of view ).
The various statements about remote segments and inferred segments elude to
an absence of intermediate or in-process spans, and absence of spans that
originate without an incoming call. This will happen a lot.. there will
definitely be intermediate spans in routine tracing.
Ex A calls B, but along the way there are several sub-operations
in-between. or A calls B, but it wasn't due to an incoming call, rather a
schedule or a circumstance of bootstrap.
So there's one thing to clarify, which is the presence of intermediate
spans, representing local activity, sometimes that local activity is the
root span.
On the other hand there's a strange circumstance of an assumption that you
always know the name of a remote downstream. Such an assumption is so
unrealistic I feel it is more likely we just aren't on the same page at
all. I'm cc'ing abhishek maybe we can find a more direct way to collaborate
as cadence like this isn't likely to resolve things quickly.
@abhiksingh, I will be in Boston prior to a meeting 9th July. Any chance we
can do a video call or even better something in-person to run through
various interop things between Zipkin and AWS X-Ray? cc also @devinsba who
is timezone close for such a thing.
|
AWS is asking to have a phone call with me, but I start feeling I'll be just the man-in-the-middle here adding little if not no value at all. Plus the timing will be inconvenient because I'm located in Finland and the AWS X-Ray team somewhere in the PST time zone. Could I suggest AWS to have some kind of video calls with you guys instead? |
yeah this was my suggestion. i am in Malaysia but on july 9 i am in boston
much closer. can certainly do something meanwhile too like next week end of
their day. you can have someone reach out directly if july is too late.
appreciated!
|
I just replied them asking to have the call done between you and them directly. Hopefully they'll be ok with it! |
The AWS X-Ray service team is ok to have the call with you, could you pass me your contact details? I'll forward those to the support contact. |
Abhishek made a meeting already. I created a google doc to put some key
questions/discussion points in
https://docs.google.com/document/d/1ZFg2SX2DUJFRE8MpodV9DL9uYgNHe-90PMQIb1DrY8k/edit
…On Wed, Jun 27, 2018 at 12:38 PM, robyf ***@***.***> wrote:
The AWS X-Ray service team is ok to have the call with you, could you pass
me your contact details? I'll forward those to the support contact.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAD61whAy1MHtbRDWb9cWI-cfQTVhqTKks5uAwxggaJpZM4UNlNX>
.
|
plan resulting from feedback from meeting with AWS For the remote subsegment name, use a fallback model:
When users report a problem, tell them to configure a valid remote service name (possibly by decorating a reporter if there's no other way). Also tell users to fix or ask for fixes when there is double-instrumentation (client child of client) as this is a problem and affects statistics. Meanwhile, Amazon will look into skipping bad data when rendering the service graph on their end. |
With reference to #87 this PR adds the possibility to use local service names as segmente name for span of type different than SERVER and CONSUMER.
By default "unknown" is still sent if the remote service name is null, only if the AWS_XRAY_USE_LOCAL_SPAN_FOR_MISSING_REMOTES environment variable is set to
true
the local service name is used.As an example (using spring-boot 2 based Java applications) by default these spans:
result in these X-Ray segments:
while with the environment variable set these spans:
result in these X-Ray segments: