-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing to close a CancelScope produces super confusing errors #882
Comments
That does looks pretty weird. Apparently we have a task that thinks it's inside a certain cancel scope, but the cancel scope doesn't think that task is inside it. That really shouldn't be possible. I can't even think of any way "user space" code could even mess this up, unless you're literally hacking around at private attributes of trio internal objects. I just read through all the code in trio that manages I don't think #552 is related – that's specifically about a user programming error where someone did What might be related is #837, which is another mysterious case involving a cancel scope and a I gather that you can't share your actual code. I'm not sure from your description how easily you can reproduce this. (One of the things that makes #837 difficult to track down is that it only happened once, in a program that was running 16+ hours, so kinda hard to dig into...) Can you reproduce it quickly and reliably? If so then I guess the next step might be just start systematically deleting chunks of the program until it stops crashing. It's kind of an annoying way to debug, but the nice thing is that its guaranteed to tell you something... |
(BTW, just in case you weren't aware, you can now use quart without trio-asyncio. But we should still track this down :-).) |
This actually might be a good hunch, more on that in the code "snippet" at bottom.
I can reliably repro; I'll take a look tomorrow to see if I can make any better headway on what the bare minimum repro is, which hopefully I should be able to share.
Indeed! I'm planning on upgrading that when I also upgrade hypercorn to 0.5.0 to avoid this gross hack -- which incidentally like I mentioned above, might be related. I'm not sure if this will run in isolation, because I've never tested it in isolation (and wanted to strip out some application-specific stuff), but this is at least the general gist. QUART_ENGINE: ContextVar[List[Any]] = ContextVar('quart_engine')
class TaskEngine:
'''Base class for things that need to manage themselves, and are
therefore "booted" during startup.
'''
def __init_subclass__(cls, *args, context, **kwargs):
super().__init_subclass__(*args, **kwargs)
cls.ENGINE_CONTEXT = context
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._engine_nursery = None
self._engine_token = None
self._engine_cancel_scope = None
async def __aenter__(self):
previous_engines = self.ENGINE_CONTEXT.get(None)
if previous_engines is not None:
# Make a copy to preserve the original context
engines = [*previous_engines, self]
else:
engines = [self]
self._engine_token = self.ENGINE_CONTEXT.set(engines)
self._engine_nursery_manager = trio.open_nursery()
self._engine_cancel_scope = trio.open_cancel_scope().__enter__()
self._engine_nursery = await self._engine_nursery_manager.__aenter__()
await self._engine_nursery.start(self.run)
async def __aexit__(self, exc_type, exc, exc_tb):
self._engine_cancel_scope.cancel()
await self._engine_nursery_manager.__aexit__(exc_type, exc, exc_tb)
self._engine_cancel_scope.__exit__(exc_type, exc, exc_tb)
self.ENGINE_CONTEXT.reset(self._engine_token)
@classmethod
def get(cls):
'''Gets a single (the outermost) task engine.'''
engines = cls.ENGINE_CONTEXT.get()
if engines is None:
return None
else:
return engines[-1]
@classmethod
def get_all(cls):
'''Gets all task engines.'''
return cls.ENGINE_CONTEXT.get([])
async def run(self, *args, task_status=trio.TASK_STATUS_IGNORED, **kwargs):
raise NotImplementedError('Engine subclasses need to implement run!')
class QuartEngine(TaskEngine, context=QUART_ENGINE):
def __init__(self, *args, host, port, **kwargs):
super().__init__(*args, **kwargs)
self.host = host
self.port = port
def _patched_quart_init(self, app, config, loop):
self.app = app
self.config = config
self.loop = loop
@aio_as_trio
async def _hypercorn_run(self, task_status):
'''This is basically a copypasta-ish of run_single inside
hyperccorn.asyncio.run, ported to start from within a coroutine
inside an already-running asyncio loop, and stripping out extra
junk we don't need.
'''
# Keep the namespace pollution inside here please!
from hypercorn.asyncio.lifespan import Lifespan
from hypercorn.asyncio.run import Server
from hypercorn.utils import create_socket
app = self.app
config = self.config
loop = self.loop
loop.set_debug(config.debug)
lifespan = Lifespan(app, config)
lifespan_task = asyncio.ensure_future(lifespan.handle_lifespan())
await lifespan.wait_for_startup()
ssl_context = config.create_ssl_context()
sock = create_socket(config)
server = await loop.create_server(
lambda: Server(app, loop, config), backlog=config.backlog,
ssl=ssl_context, sock=sock
)
task_status.started()
try:
# Biggest non-removal change is here.
await server.serve_forever()
finally:
server.close()
await server.wait_closed()
await lifespan.wait_for_shutdown()
lifespan_task.cancel()
await lifespan_task
async def run(self, task_status=trio.TASK_STATUS_IGNORED):
global _CNC_HOST
global _CNC_PORT
try:
# This sets our app, config, and loop
with patch('quart.app.run_single', self._patched_quart_init):
cnc_app.run(host=self.host, port=self.port)
_CNC_HOST = self.host
_CNC_PORT = self.port
await self._hypercorn_run(task_status)
finally:
self.app._got_first_request = False Side note, this is not my proudest code! But of special note, the nested context manager that's having to manually invoke the nursery dunders might be relevant. |
Umm. Yeah, that is a lot of risky-looking code, and I don't really understand what you're trying to do even, so I can't give you any clear feedback on whether it makes sense. One thing that jumps out at me is that this bit in await self._engine_nursery_manager.__aexit__(exc_type, exc, exc_tb)
self._engine_cancel_scope.__exit__(exc_type, exc, exc_tb) It's very common that All that said... I still don't see any obvious way that this would cause the traceback that you showed at the beginning of the thread :-(. So there's still an open question about that. |
Short answer: don't do that, it's tricky to get right at best. I'd recommend to handle this kind of situation with an AsyncExitStack, assuming that you can't easily do the |
Thanks for the catch! For the sake of debugging, I updated two different ways. First: async def __aexit__(self, exc_type, exc, exc_tb):
self._engine_cancel_scope.cancel()
try:
try:
await self._engine_nursery_manager.__aexit__(exc_type, exc,
exc_tb)
except BaseException:
cancel_exc_info = sys.exc_info()
else:
cancel_exc_info = exc_type, exc, exc_tb
finally:
self._engine_cancel_scope.__exit__(*cancel_exc_info)
finally:
self.ENGINE_CONTEXT.reset(self._engine_token) After that, the strange traceback was still there. However, when I more-properly updated it to... async def __aenter__(self):
previous_engines = self.ENGINE_CONTEXT.get(None)
if previous_engines is not None:
# Make a copy to preserve the original context
engines = [*previous_engines, self]
else:
engines = [self]
self._engine_token = self.ENGINE_CONTEXT.set(engines)
async with AsyncExitStack() as stack:
self._engine_cancel_scope = stack.enter_context(
trio.open_cancel_scope())
self._engine_nursery = await stack.enter_async_context(
trio.open_nursery())
self._engine_exitstack = stack.pop_all()
await self._engine_nursery.start(self.run)
async def __aexit__(self, exc_type, exc, exc_tb):
self._engine_cancel_scope.cancel()
try:
await self._engine_exitstack.aclose()
finally:
self.ENGINE_CONTEXT.reset(self._engine_token) ...and now a ctrl+C results in a I did also notice that in the self._engine_nursery_manager = trio.open_nursery()
self._engine_cancel_scope = trio.open_cancel_scope().__enter__()
self._engine_nursery = await self._engine_nursery_manager.__aenter__() (it should instead be cancel scope Is it worthwhile for me to try and get the traceback back so we can figure out what was actually causing it? If so, it'll have to wait until next weekend, when I'll have some more time to revisit.
Like I said, it's gross code. I wrote it without an internet connection, so I was from-memorying the stdlib and didn't remember the API for AsyncExitStack (and didn't want to waste time That being said, I think there's a documentation problem here, because it didn't really occur to me that the cancel scope |
I'm still super-confused about how you're managing to get that specific error, so if you do find a chance to dig into it then please let us know :-). But, you've solved your problem, and it seems very likely it was related somehow to your code breaking Python semantics that Trio relies on, so at this point it's more a fascinating mystery than an urgent fix. By the way... I still don't understand what this engine thing is trying to do, but I think you can get rid of that cancel scope entirely? Nurseries already have a built-in cancel scope that includes the exact same code as the nursery block; you can access it as |
Will do! Sometime soon, I hope (famous last words).
D'oh. Of course, absolutely. I'd completely forgotten about the built-in nursery cancel scope.
That's a much longer conversation (which isn't really relevant here). The short version is that it's a WIP/prototype of a pattern I'm trying to figure out for running in-process async background tasks while avoiding module-level singletons. There's a lot of "lessons learned the hard way" baked into that, which at some point I should probably write a blog post about or something... in all my spare time (ha!) |
Right. But explicitly calling multiple sub-objects' |
I've run into an error like this a number of times in the course of some of my low-level cancellation work. I'm pretty sure the sequence of events is:
If you look at This could definitely be made less brittle, but we're planning to revamp a bunch of this infrastructure with #958, so maybe we should put this on hold for now and look again after #958 lands? |
Ahh, I'd completely forgotten I said I'd try to get an easy repro of this -- work got especially crazy and I'm not likely to have time anywhen soon. I'm going to defer everything to you guys on this one, because I think at the end of the day the actual problem was, well, me! But I agree that improving the messaging in the resulting errors could help make that more obvious. |
This appears to be the same bug and I have simple code that reproduces the error. Maybe this helps: python-trio/pytest-trio#82 |
Another related issue report in #1055. A first step here might be to have TrioInternalError link to part of the documentation that explains all of the various undefined/misguided/... things we've seen people do that result in TrioInternalErrors. The "please report a bug!" instruction is probably misleading more often than it's right -- the problem is empirically a lot more likely to be in something clever the user is trying to do that violates our invariants, rather than Trio misbehaving despite being used as documented. |
When exiting an outer cancel scope before some inner cancel scope nested inside it, Trio would previously raise a confusing chain of internal errors. One common source of this misbehavior was an async generator that yielded within a cancel scope. Now the mis-nesting condition will be noticed and in many cases the user will receive a single RuntimeError with a useful traceback. This is a best-effort feature, and it's still possible to fool Trio if you try hard enough, such as by closing a nursery's cancel scope from within some child task of the nursery. Fixes python-trio#882. Relevant to python-trio#1056, python-trio#264.
(update: this appears to have been due to failing to close the CancelScope; see: #882 (comment))
I'm not sure either way if this is ultimately a dupe of #552, so feel free to close if so. I'm also having a hell of a hard time figuring out how to reliably reproduce this, and it may actually be an issue in trio_asyncio?
What's the issue?
During teardown, there's a KeyError in
trio._core._run:CancelScope._remove_task
.Steps to reproduce
I haven't been able to repro this outside of one specific script, but I've tried a lot of different things. For the sake of brevity, I'll call that script "voidreader", since that's what I'm actually calling it and if you can surmise any secret sauce from that name then you deserve to know anyways. Background:
trio.run
trio.run
, but I'm pretty confident that isn't related, because I still see the error when I comment out the logging system. However, that library does some patching of the stdlib logging module, which... may or may not be relevant?aio_as_trio
; there's other uses of it as wellI've tried reproing with stuff like this:
but to no avail -- note that control+C still works here, raising a keyboardinterrupt. My hunch is that this may have something to do with the Quart stuff that's running in an asyncio context.
Traceback
Expected behavior
At the very least, a better traceback -- at one point I thought I was seeing that traceback during startup, when ultimately the problem was a recursion issue in my logging patch. While I was trying to repro the trio issue, I stumbled upon the problem in my code, fixed it, and suddenly things ran happily again -- until shutdown, that is, which is where the traceback was originating from anyways. Having more information about surrounding exceptions would have been really helpful early on, to help me isolate which problems were mine vs which problems might be upstream.
Of course also, I'd expect a cleaner shutdown! :)
The text was updated successfully, but these errors were encountered: