-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Doing a rolling restart caused the typing worker to spin at 100% CPU until it was restarted #11750
Comments
I think this is related to #8524 |
I took the tiniest stab at this but was overwhelmed with the size of the log and the fact that I wasn't really sure what I was looking for -it was a little like looking for a needle in a haystack when you don't know what a needle looks like. But I did find some patterns after the rolling restart (at 12:03) that looked like this:
Specifically, you can see that the tcp handler is attempting to fetch the same rows over and over for the cache (in this case, the rows between 409607869 and 409608239, but there are many other examples of rows it seemed to get stuck on). I have no idea why this happened but it does explain the graphs a little (cpu getting chewed up trying to fetch the same thing while the other queues pile up). This might be a total red herring but I thought I'd mention it because the logs on this will be rotated out very shortly. Oh and ftr this view is the log after filtering out everything but "fetching replication rows" in case that matters. |
Nothing conclusive, just suspicious and making a note.
missing_updates = cmd.prev_token != current_token
while missing_updates:
logger.info(
"Fetching replication rows for '%s' between %i and %i",
stream_name,
current_token,
cmd.new_token,
) |
I was a little wiser than usual and took a 20% of time each was going to each of these lines: 426, 427 and 434 in synapse/synapse/handlers/typing.py Lines 416 to 444 in e24ff8e
Of course that's just a symptom — probably of the fact that the position resets back to zero, but at least mildly interesting to see what's costly here. |
Have we seen this recently? If not I'd be inclined to close unless it resurfaces. |
Rather than keeping them around forever in memory, slowing things down. Fixes #11750.
Rather than keeping them around forever in memory, slowing things down. Fixes #11750.
When performing a rolling restart of matrix.org (Updating 5cc41f1 .. 8e8a008), the
typing1
worker suddenly started using a lot of CPU.Judging by eye (on Grafana; not with these zoomed-out shots below that are more to give an idea of the usual expected trend), it looks like it started occurring just after the main process was restarted (the
typing1
worker had not restarted by this point), but equally some synchrotrons and client readers were started at a similar time (and it's hard to see exactly since the granularity of the graph is not very fine).It seems to coincide with a high rate of
GET ReplicationGetStreamUpdates
.RDATA commands (particularly for
caches
andevents
) seem to be queueing up (and that sounds bad):The oscillatory 'Replication connections' graph seems strange — I don't know what it means exactly as of right now but it's something that might be worth looking into.
I restarted the worker myself (at approximately 12:44 UTC) to get it to return to normalcy.
Here is an excerpt of the schedule for the worker restarts, in case that's useful:
The text was updated successfully, but these errors were encountered: