You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CoreConnectTimeoutError is our most frequent error. According to Sentry, there are many different reasons for this error. We have eliminated many of them, but there are still many remaining reasons. Eliminating, say, 90% of them should be a significant improvement. I want to list most of the possible reasons for this issue. Then, if necessary, we can add separate sub-issues for some of them.
We can split CoreConnectTimeoutError cases into two categories.
Core continues to work. It executes some slow coroutines but cannot communicate with GUI until the timeout, which is currently equal to 120 seconds.
Cory apparently completely "freezes" and, after some point, does not put any output into stdout.
We can remedy the problem in the first case by increasing the timeout time. While it is preferable to connect with the GUI as quickly as possible, it is better to connect with a delay than to have a CoreConnectTimeoutError. For that reason, I suggest increasing the timeout to a longer time, say, 240 seconds.
Cases when Core continues to work till the timeout
The most frequent case is the delay in Component.start(), a generic method all components use. Sometimes, it freezes for 90 seconds or more. It calls the component-specific run method. To see the exact place inside the run method where a specific component freezes, we should enable slow coroutine stack tracing by setting SLOW_CORO_STACK_TRACING to 1.
DownloadManager.start() sometimes takes more than 52 seconds for a single step at await self.load_checkpoints(). I believe it is already fixed in the main branch.
DownloadManager.start_handle(): up to 60 seconds. Apparently, it freezes at the line
_ProactorDatagramTransport._loop_reading sometimes requires up to 5 seconds
<IPv8.start.<locals>.ticker() running at ipv8_service.py:148>: 16 seconds
GuiProcessWatcher:check_gui_process sometimes freezes for 9.74 seconds. It is surprising because the method is straightforward and only contains psutil.Process.is_running() and psutil.Process.status() calls. It looks like they can sometimes be pretty slow.
Cases when Core completely "freezes"
In Sentry, we have multiple events of the same case; the last stdout line is:
INFO - tribler.core.check_os(121) - Check and enable code tracing. Process name: "core". Log dir: "C:\Users\<officer>\AppData\Roaming\.Tribler\7.13\log"
This is the following code fragment. In it, the method check_and_enable_code_tracing actually does nothing except logging because it first checks some command line arguments and does not find them:
defrun_tribler_core_session(...) ->int:
...
trace_logger=check_and_enable_code_tracing('core', log_dir)
loop=asyncio.get_event_loop()
exception_handler=default_core_exception_handlerloop.set_exception_handler(exception_handler.unhandled_error_observer)
try:
exit_code=loop.run_until_complete(core_session(config, components=list(components_gen(config))))
finally:
iftrace_logger:
trace_logger.close()
# Flush the logs to the file before exitingforhandlerinlogging.getLogger().handlers:
handler.flush()
returnexit_code
Core successfully executed the check_and_enable_code_tracing call, but where can it freeze after that? list(components_gen(config)) should output other log records, so the freeze should happen before that.
In my opinion, as a minimum change, we should significantly increase the timeout value and enable slow coroutine stack tracing by default for the binary build. Enabling stack tracing should slow down Core code a bit, but it is not too visible for users (as UI works at the same speed as before) and is crucial to discovering the actual reason for freezes.
The text was updated successfully, but these errors were encountered:
CoreConnectTimeoutError is our most frequent error. According to Sentry, there are many different reasons for this error. We have eliminated many of them, but there are still many remaining reasons. Eliminating, say, 90% of them should be a significant improvement. I want to list most of the possible reasons for this issue. Then, if necessary, we can add separate sub-issues for some of them.
We can split CoreConnectTimeoutError cases into two categories.
We can remedy the problem in the first case by increasing the timeout time. While it is preferable to connect with the GUI as quickly as possible, it is better to connect with a delay than to have a CoreConnectTimeoutError. For that reason, I suggest increasing the timeout to a longer time, say, 240 seconds.
Cases when Core continues to work till the timeout
The most frequent case is the delay in
Component.start()
, a generic method all components use. Sometimes, it freezes for 90 seconds or more. It calls the component-specificrun
method. To see the exact place inside therun
method where a specific component freezes, we should enable slow coroutine stack tracing by settingSLOW_CORO_STACK_TRACING
to 1.DownloadManager.start()
sometimes takes more than 52 seconds for a single step atawait self.load_checkpoints()
. I believe it is already fixed in the main branch.DownloadManager.start_handle()
: up to 60 seconds. Apparently, it freezes at the lineSSLProtocol.connection_made()
sometimes freezes for 2.76 secondsTorrentChecker.check_random_tracker()
sometimes freezes for 30 seconds or more. It is possible that Fix slow database queries, by using partial indexidx_torrentstate__last_check__partial
#7955 fixes it, but other reasons can also cause it. To be sure, we should look at the stack trace after enabling slow coroutine stack tracing._ProactorDatagramTransport._loop_reading
sometimes requires up to 5 seconds<IPv8.start.<locals>.ticker() running at ipv8_service.py:148>
: 16 secondsGuiProcessWatcher:check_gui_process
sometimes freezes for 9.74 seconds. It is surprising because the method is straightforward and only containspsutil.Process.is_running()
andpsutil.Process.status()
calls. It looks like they can sometimes be pretty slow.Cases when Core completely "freezes"
In Sentry, we have multiple events of the same case; the last stdout line is:
This is the following code fragment. In it, the method
check_and_enable_code_tracing
actually does nothing except logging because it first checks some command line arguments and does not find them:Core successfully executed the
check_and_enable_code_tracing
call, but where can it freeze after that?list(components_gen(config))
should output other log records, so the freeze should happen before that.In my opinion, as a minimum change, we should significantly increase the timeout value and enable slow coroutine stack tracing by default for the binary build. Enabling stack tracing should slow down Core code a bit, but it is not too visible for users (as UI works at the same speed as before) and is crucial to discovering the actual reason for freezes.
The text was updated successfully, but these errors were encountered: