Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[7.13.0] CoreCrashedError when REST HTTP server is starting (with explanation) #7592

Closed
kozlovsky opened this issue Sep 5, 2023 · 2 comments
Assignees
Milestone

Comments

@kozlovsky
Copy link
Contributor

kozlovsky commented Sep 5, 2023

According to Sentry, this is the most popular error in the latest release.

Note that ALL recorded occurrences of the problem in 7.13.0 are on Win32 (Windows 7, 8, 10) and never on Win64.
It might not be by accident

Now, after extended logging was added in 7.13, it is possible to decipher what is happening.

Analysis

The error happens at Tribler start when GUI tries to connect to Core, and Core is starting the REST HTTP API.

tribler.gui.utilities.CreationTraceback:
  File "run_tribler.py", line 115, in <module>
  File "tribler\gui\start_gui.py", line 79, in run_gui
  File "tribler\gui\upgrade_manager.py", line 249, in on_worker_finished
   File "tribler\gui\utilities.py", line 435, in trackback_wrapper
  File "tribler\gui\core_manager.py", line 132, in start_tribler_core
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "tribler\gui\utilities.py", line 438, in trackback_wrapper
  File "tribler\gui\utilities.py", line 435, in trackback_wrapper
  File "tribler\gui\core_manager.py", line 304, in on_core_finished
tribler.gui.exceptions.CoreCrashedError: The Tribler core has unexpectedly finished with exit code 1 and status: 0.
Error message: Application error

The traceback says that the Core unexpectedly stopped with some error. We can see the Core traceback in the last core stdout/stderr output fragment:

RESTComponent: exception in RESTComponent.start(): TimeoutError:
Traceback (most recent call last):
  File "tribler\core\components\component.py", line 39, in start
  File "tribler\core\components\restapi\restapi_component.py", line 139, in run
  File "tribler\core\components\restapi\rest\rest_manager.py", line 149, in start
  File "tribler\core\components\restapi\rest\rest_manager.py", line 167, in start_http_site
  File "asyncio\tasks.py", line 490, in wait_for
asyncio.exceptions.TimeoutError

The corresponding code is:

async def start_http_site(self):
    api_port = ...
    self.site = web.TCPSite(self.runner, self.http_host, api_port, shutdown_timeout=self.shutdown_timeout)
    self._logger.info(...)
    try:
        # The self.site.start() is expected to start immediately. It looks like on some machines, it hangs.
        # The timeout is added to prevent the hypothetical hanging.
        await asyncio.wait_for(self.site.start(), timeout=SITE_START_TIMEOUT)
    except ...

The call of self.site.start() should be fast; why do we have this TimeoutError? An additional newly added log information reveals this (right before the TimeoutError):

ERROR <watching_thread:96> tribler.core.utilities.slow_coro_detection._report_freeze():
Still executing <Task pending name='DownloadManager:start_handle 2' coro=<DownloadManager.start_handle() running at tribler\core\components\libtorrent\download_manager\download_manager.py:634> cb=[TaskManager.register_task.<locals>.done_cb() at ipv8\taskmanager.py:128]>
Stack for <Task pending name='DownloadManager:start_handle 2' coro=<DownloadManager.start_handle() running at tribler\core\components\libtorrent\download_manager\download_manager.py:634> cb=[TaskManager.register_task.<locals>.done_cb() at ipv8\taskmanager.py:128]> (most recent call last):
  File "asyncio\events.py", line 81, in _run
  File "tribler\core\components\libtorrent\download_manager\download_manager.py", line 634, in start_handle
...
ERROR <watching_thread:96> tribler.core.utilities.slow_coro_detection._report_freeze():
Still executing <Task pending name='DownloadManager:start_handle 2' coro=<DownloadManager.start_handle() running at tribler\core\components\libtorrent\download_manager\download_manager.py:634> cb=[TaskManager.register_task.<locals>.done_cb() at ipv8\taskmanager.py:128]>
Stack for <Task pending name='DownloadManager:start_handle 2' coro=<DownloadManager.start_handle() running at tribler\core\components\libtorrent\download_manager\download_manager.py:634> cb=[TaskManager.register_task.<locals>.done_cb() at ipv8\taskmanager.py:128]> (most recent call last):
  File "asyncio\events.py", line 81, in _run
  File "tribler\core\components\libtorrent\download_manager\download_manager.py", line 634, in start_handle
...
ERROR <patch:52> tribler.core.utilities.slow_coro_detection._report_long_duration():
Slow coroutine step execution (duration=7.290 seconds):
<Task pending name='DownloadManager:start_handle 2' coro=<DownloadManager.start_handle() running at tribler\core\components\libtorrent\download_manager\download_manager.py:662> wait_for=<Future pending cb=...

That means, when wait_for was about to start its inner coroutine self.site.start(), another unrelated coroutine DownloadManager.start_handle occupied the asyncio loop for more than seven seconds, and as a result, the five-second timeout of wait_for(self.site.start(), ...) was triggered even before the server has any chance to start.

You can see from the log that DownloadManager:start_handle spends most of its time on line 634. This line is:

ltsession = self.get_session(download.config.get_hops())

Where get_session is:

def get_session(self, hops=0):
    if hops not in self.ltsessions:
        self.ltsessions[hops] = self.create_session(hops)

    return self.ltsessions[hops]

On my machine, the typical call of DownloadManager.create_session takes about 20-30 milliseconds to complete. But it looks on some machines for some downloads it takes more than seven seconds to create and initialize a libtorrent session.

Conclusion

  • The immediate fix should be removing the wait_for wrapper off the self.site.start() call. An execution of completely unrelated coroutines triggers the timeout, so it does not help us to manage the DownloadManager.site.start() call. Or at least we can significantly increase the timeout, say, to one minute or more.
  • Then we should move the slow synchronous call DownloadManager.create_session(...) out of the asyncio loop. It is hard to say why the execution of this method is surprisingly slow sometimes, but from Sentry error reports, we can see that in all cases when the TimeoutError was triggered, it was a DownloadManager.create_session(...) call which was responsible for that.
  • To better understand why the DownloadManager.create_session(...) call is so slow sometimes, we can apply a recently added @profile decorator to it.
  • Also, we can add a way to attach profiling results to the Sentry report to be able to analyze them.

Partially related to #7065

@sentry-for-tribler
Copy link

Sentry issue: TRIBLER-T0

@kozlovsky kozlovsky self-assigned this Sep 5, 2023
@kozlovsky kozlovsky added this to the 7.13.1 milestone Sep 5, 2023
kozlovsky added a commit to kozlovsky/tribler that referenced this issue Oct 2, 2023
…start()` when starting HTTP REST API server
kozlovsky added a commit to kozlovsky/tribler that referenced this issue Oct 2, 2023
…start()` when starting HTTP REST API server
kozlovsky added a commit that referenced this issue Oct 2, 2023
Fixes #7592: remove the `wait_for` wrapper around `await site.start()` when starting HTTP REST API server
@kozlovsky
Copy link
Contributor Author

Fixed in #7617

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant