-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
async operations with patched aiobotocore gives error "Already ended segment and subsegment cannot be modified." #164
Comments
Not sure if it is relevant, but I've noticed that the XRay segments when using asyncio.gather() are build up in a deep tree. yields segments:
which doesn't really feel correct, as I would expect a similar outcome as when I called them sequentially:
I'm just raising this as observation, as maybe this isconstruction is the cause of the issue. If not then my comment can be discarded. |
I've added annotations in the AsyncContext, where the (sub)segment id and name are being printed. The output when aiobotocore is not being patched:
The output when aiobotocore is being patched:
The number of subsegments which are being closed before the segment is being closed varies between runs. |
The problem seems to be that the aws_xray_sdk has a single list of traces which are open. And has problems when segments are closed out of order (which is a normal pattern when asyncio.gather() is being used, and I think also common for the async pattern in general) With some debugging I found that async_recorder.py::record_subsegment_async() keeps track of the subsegment. as soon as it arrives in the finally statement, it adds the required annotations to the subsegment which it has a reference to. But when it calls end_subsegment it doesn't close the subsegment it has a reference to, but it closes a subsegment which is on top of the entities stack, which might actually be a completely different subsegment. |
I'm experiencing this as well. This happens if I have an async function that uses Once it's switched to |
Hey, Thank you for reporting the issue you are having and giving a detailed explanation of it. This will definitely help us in determining the root cause and finding a solution for it. It sounds like the issue is that the asynchronous tasks aren't holding the appropriate references to their parents, and so when the tasks are finished and the subsegments are closed, only the top-most parents in the context entity list are being referenced. We'll have to investigate this and see what changes are necessary to provide a fix. Please stay tuned Thanks again for reporting this issue! |
Any update on this issue? I'm experiencing a similar problem but even with or without patching aiobotocore.
This is how I used aws_xray_sdk
These are the modules I'm using:
|
Hey, Sorry for the delayed response. We're still investigating this internally and are really breaking it down so that we can isolate whether the issue is truly coming from the way we are patching aiobotocore or if it's a bug with how we propagate the context information using TaskLocalStorage. @dangtrinhnt-mmt You mentioned that you are experiencing the problem with or without patching aiobotocore. Is there a way you can provide a minimal reproduction code sample to us? |
Hey, I'm afraid you are staring at Here is my updated script, which still demonstrates the issue, but does not use aiobotocore.
The problem is this:
As you can see, steps 7, 10 and 12 are all performing bad operations. If the handlers would try to put an annotation to their segment, handler2 would discover his segment to have been closed already. The XRay code assumes that each handler passed to asyncio.gather() are completing in the exact reversed order as provided to asyncio.gather call. For which there is no such guarantee, stuff is async, so one handler might be much faster then another. I guess what needs to happen is:
|
Thanks for confirming that this is not related to aiobotocore. In our async tasks, we store the information context inside task local storage here. It makes me wonder why this mechanism breaks inside the asyncio.gather() call. By design, each context should be confined within a particular task that is running the code. It seems asyncio.gather() might spin off its own tasks, which might be why we're seeing nested subsegments instead. I think we have a good starting point to work with here. For reference, here is where asyncio.gather() is implemented. |
oh sorry, I miss your comment. It's a little bit complicated for me to provide the minimal reproduction code sample because the error comes from many of our services in the microservice stack. Let me see if I can do something. Thanks for looking into this. |
Hi! Having a similar issue however not with asyncio but the segmentio library which uses a background thread to send http requests. I am doing patch_all and essentially my background threads are failing and not performing the their task. |
I'm seeing this using gevent as well, with the regular, non-aio versions of |
Hi all, |
#203 might be relevant. Please see if that's an easier repro. |
I'm running into this as well, any use of |
@stfairy Thanks for the simpler code. Taking a look at it, it doesn't use The issue with @IvDoorn 's repro code is that it holds a reference of subsegment to put annotation after the sleep. When the subsegments are closed out of order, this reference is messed up since now the subsegment has been closed by some other coroutine execution. This is clearly an issue and we are working on finding an efficient way to address it. |
Hi @jonathan-kosgei |
@srprash Thanks, here is a simplified version of my code; lambda_handler.py
xray.py
|
Hello, any progress on the issue? Would you mind sharing information and thoughts so others could join thinking :) |
Hi @cbuschka, Sorry we haven't been able to figure this one out yet. I spent quite a bit of time trying to fix this issue myself to no avail. I think that the root cause might lie with us using a stack here stored in each task (when using Unfortunately I spent a bit too much trying to fix our current setup instead of trying to replace it altogether. It might be worth looking into using the |
Thank you for sharing your findings. I agree that contextvars is the way to go. I think we have to check backward compatibility. IMHO the problem is caused by the add/remove order mentioned by IvDoorn in comment #164 (comment). I hope I can find some time on weekend to investigate the rewrite of AsyncContext. I'll come back to you when I have something to share. |
I have started working on it but unfortunately I am too busy currently to make any real progress. So I wanted to share my results. You can find my results in repo https://github.com/cbuschka/aws-async-xray-issue. IMHO the general strategy for handling the segments in async context should be:
If someone is interested in working on it and has the time, feel free to reuse the code if you consider it to be useful. Possibly I have time to work on it in the last weeks of the year. Feel free to contact me if you have questions regarding the code. |
Hey, any progress on this issue? |
Hi @Padwicker |
Hi, I am sorry for not responding in a timely fashion. I am currently very busy in my job and had no time to dig into it, yet. So do not expect progress on this from my side in the near future. |
Hello, any updates on this issue? |
I tried with |
Surely something like this would work. from contextvars import ContextVar
entities = ContextVar("entities")
segment = ContextVar("segment")
class ContextProperty:
def __init__(self, context_var):
self.context_var = context_var
def __set__(self, instance, value):
self.context_var.set(value)
def __get__(self, instance, owner):
return self.context_var.get()
class ContextVarStorage(object):
entities = ContextProperty(entities)
segment = ContextProperty(segment)
def clear(self):
entities.set(None)
segment.set(None) Issue it's not generic like the other solutions, |
I encountered OP's original error, and solved it by patching Detailed Explanation I agree with @IvDoorn 's analysis that this problem is to do with asyncio usage rather than any specific library, and addiitonally displays as a symptom of endlessly nested subsegments (rather than parallel subsegments). Specifically, every asyncio Task will used a shared segment stack (see task_factory) which means subsegments will be added and removed by parallel processing, whereas the segment stack is designed to be used in sequential processing only. My solution was to propagate a copy of the segment stack to child asyncio Task's. This means that any subsegments created by the Task will have the correct parent. It was over a year ago, so i don't remember exactly what happens to the unused top part of the segment stack in the child Task, but IIRC it gets ignored and just works out in the wash
|
Hi @garyd203, Thank you for developing this library & proposing your solution. We will add it to our backlog to investigate using your |
@NathanielRN I don't think this PR should be closed I have updated to a version that includes #340 and am still seeing the original issue
My package versions are:
I'm planning to try @garyd203's library next to see if it resolves |
I'm experiencing the same issue and I have the same versions as @jall. |
Reopening this issue as there are multiple reports of it persisting even after PR #340 |
I did try & hit this cannot find the current segment/subsegment, please make sure you have a segment open which I couldn't resolve with some brief debugging. IIRC it seemed like the patch library could've been out of date? I ended up removing aiboto from the patched libraries until this is fixed: from aws_lambda_powertools import Tracer
from aws_xray_sdk.core.patcher import SUPPORTED_MODULES
tracer = Tracer(service=service_name, auto_patch=False)
modules_to_patch = [
m for m in SUPPORTED_MODULES if m not in {"aioboto3", "aiobotocore"}
]
tracer.provider.patch(modules_to_patch, raise_errors=False) |
@jall do you happen to have some code that reproduces your error? Context propagation is dependent on the actual calling code, so if it fails (which is what If you don't know where the error comes from, IIRC you can change the config on your dev environment (with something like |
@carolabadeer note that there's multiple similar-looking issues here.
|
I'm still getting the original error as well when using In any case, I don't think that this error should throw an exception out to the caller. Logging shouldn't ever cause my system to break, and xray is effectively a logger. So, similar to the If it helps, I'm using |
@mdnorman can you confirm my suspicion that your strawberry resolver functions are not async (as a consequence of using non-async |
My resolver functions are asynchronous, because they do other things besides the Postgres calls |
Fair enough. How do you call the non-async postgres functions? |
Not at my computer right now but I don't do anything special. It's just a call to most-likely already open connection to make a query. I use a connections pool so it's not exactly straightforward but I'm also not explicitly passing any additional context down to it. |
oh, i mean how do you execute the non-async function from within an async function (like the resolver). Do you call it directly, or do you put it in a thread-pool, or some other approach? No rush to respond, I'm just trying to figure out how to support your use case. Ta! |
The async function is in a class, and the sync function is in the same class, so it's something like this: class MyClass:
async def do_some_async_work(self, params: list[str]):
value_ids = await asyncio.gather(*[self.get_a_value(param) for param in params])
return self.get_some_results(value_ids)
async def get_a_value(self, param: str) -> str:
response = await call_some_service(param)
return response["id"]
def get_some_results(self, value_ids: list[str]) -> :
with self.get_db_connection() as conn:
results = conn.execute("select something from somewhere where value_id in %s", tuple(value_ids)).fetchall()
return [result[0] for result in results] The No thread-pools for this right now. |
Thanks @mdnorman , it's helpful to know how people are using xray. In your case, I suspect the problem is that the instrumented sync code for Cheers, |
I'm having this error when patching My solution was to patch the httpx patcher (pun intended) - you can see that error happens after response. For some reason subsegment gets closed before context manager is exited. from aws_xray_sdk.core import patch, xray_recorder
from aws_xray_sdk.core.exceptions.exceptions import AlreadyEndedException
from aws_xray_sdk.core.models import http
from aws_xray_sdk.ext.httpx.patch import AsyncInstrumentedTransport
from aws_xray_sdk.ext.util import get_hostname, inject_trace_header
def configure_xray():
xray_recorder.configure(
context_missing="IGNORE_ERROR",
)
async def handle_async_request(
self: AsyncInstrumentedTransport,
request: httpx.Request,
) -> httpx.Response:
async with xray_recorder.in_subsegment_async(
get_hostname(str(request.url)), namespace="remote"
) as subsegment:
if subsegment is not None:
subsegment.put_http_meta(http.METHOD, request.method)
subsegment.put_http_meta(
http.URL,
str(
request.url.copy_with(password=None, query=None, fragment=None)
),
)
inject_trace_header(request.headers, subsegment)
response = await self._wrapped_transport.handle_async_request(request)
if subsegment is not None:
# Added try-except here
try:
subsegment.put_http_meta(http.STATUS, response.status_code)
except AlreadyEndedException:
pass
return response
AsyncInstrumentedTransport.handle_async_request = handle_async_request
patch(
[
"aioboto3",
"aiobotocore",
"boto3",
"botocore",
"httpx",
],
raise_errors=False,
) |
I'm getting this while using aioboto3 with aws-xray-sdk = "2.13.0". Any progress on a fix? |
Same, I'm getting this same error when using I'm using AWS Lambda Powertools + aioboto3. Looking forward to a much needed fix soon, |
I have seen this when trying to enable X-Ray for a lambda webapp using aws-powertools. It fails to makes multiple simultaneous aiobotocore calls in
Snippets to demonstrate (incomplete, not runnable as is): async def get_ssm_param(ssm_client, param_name: str):
response = await ssm_client.get_parameter(
Name=param_name,
WithDecryption=True,
)
return response["Parameter"]["Value"]
async def create_web_app() -> App:
app = App(routes=create_routes())
aio_session = get_session()
async with aio_session.create_client("ssm") as ssm:
username, password = await gather(
get_ssm_param(ssm, DB_USERNAME),
get_ssm_param(ssm, DB_PASSWORD),
) @metrics.log_metrics(capture_cold_start_metric=True)
@tracer.capture_lambda_handler()
@logger.inject_lambda_context(
correlation_id_path=correlation_paths.API_GATEWAY_REST,
log_event=True,
clear_state=True,
)
def main(event, context):
global app
global mangum
if mangum is ...:
app = loop.run_until_complete(create_web_app())
mangum = Mangum(app, lifespan="off")
return mangum(event, context)
I agree with above posters that:
|
I'm running into some problems with XRay in combination with async functions and aiobotocore.
Packages:
I've created a very small sample script to demonstrate my issue:
The expected outcome should be something like:
The timings are irrelevant, but it it means that both approaches of using aiobotocore and XRay work together. However, the actual outcome is:
Using asyncio.gather() doesn't seem to play nice with xray, as when the call to clouddirectory comes, the subsegment which was used for tracing the call has already been closed. Only when calling the async calls sequentially will xray work nicely. But that obviously would kind of defeat the benefits of using async. :)
NOTE: if I remove the lines:
from the above test script, everything works well, but no tracing is occurring on the clouddirectory calls obviously.
The text was updated successfully, but these errors were encountered: