Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

synapse wedges after logrotation #8892

Closed
michaelkaye opened this issue Dec 8, 2020 · 1 comment · Fixed by #8918
Closed

synapse wedges after logrotation #8892

michaelkaye opened this issue Dec 8, 2020 · 1 comment · Fixed by #8918
Assignees
Labels
z-bug (Deprecated Label)

Comments

@michaelkaye
Copy link
Contributor

No obvious goings on or CPU spinning in a py-spy top - looked like everything was hung. Process was generally just sitting idle.

Did see this in a py-spy dump:

Thread 52103 (idle): "MainThread"
    start (synapse/metrics/background_process_metrics.py:266)
    set_current_context (synapse/logging/context.py:628)
    __exit__ (synapse/logging/context.py:587)
    call_later (synapse/util/__init__.py:131)
    run_sighup (synapse/app/_base.py:263)
    start (synapse/metrics/background_process_metrics.py:267)
    set_current_context (synapse/logging/context.py:628)
    __enter__ (synapse/logging/context.py:584)
    messageReceived (synapse/replication/tcp/redis.py:108)
    replyReceived (txredisapi.py:1919)
    dataReceived (txredisapi.py:1857)
    _dataReceived (twisted/internet/tcp.py:249)
    doRead (twisted/internet/tcp.py:243)
    _doReadOrWrite (twisted/internet/posixbase.py:614)
    callWithContext (twisted/python/context.py:85)
    callWithContext (twisted/python/context.py:122)
    callWithContext (twisted/python/log.py:86)
    callWithLogger (twisted/python/log.py:103)
    doPoll (twisted/internet/epollreactor.py:235)
    mainLoop (twisted/internet/base.py:1295)
    run (twisted/internet/base.py:1283)
    run (synapse/app/_base.py:117)
    start_reactor (synapse/app/_base.py:133)
    start_worker_reactor (synapse/app/_base.py:80)
    start (synapse/app/generic_worker.py:988)
    <module> (synapse/app/generic_worker.py:993)
    _run_code (runpy.py:85)
    _run_module_as_main (runpy.py:193)
Thread 52107 (idle): "Thread-1"
    select (selectors.py:415)
    serve_forever (socketserver.py:232)
    run (threading.py:865)
    _bootstrap_inner (threading.py:917)
    _bootstrap (threading.py:885)
Thread 52108 (idle): "PoolThread-140238481588800-0"
    wait (threading.py:296)
    get (queue.py:170)
    work (twisted/_threads/_threadworker.py:45)
    run (threading.py:865)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:917)
    _bootstrap (threading.py:885)

... more of the same
Thread 52118 (idle): "PoolThread-twisted.internet.reactor-0"
    wait (threading.py:296)
    get (queue.py:170)
    work (twisted/_threads/_threadworker.py:45)
    run (threading.py:865)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:917)
    _bootstrap (threading.py:885)
Thread 44785 (idle): "raven-sentry.BackgroundWorker"
    wait (threading.py:296)
    get (queue.py:170)
    _target (sentry_sdk/worker.py:122)
    run (threading.py:865)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:917)
    _bootstrap (threading.py:885)
Thread 56726 (idle): "Thread-19758"
    collect (synapse/metrics/background_process_metrics.py:121)
    collect (prometheus_client/registry.py:75)
    collect (synapse/metrics/__init__.py:58)
    generate_latest (synapse/metrics/_exposition.py:97)
    do_GET (synapse/metrics/_exposition.py:203)
    handle_one_request (http/server.py:414)
    handle (http/server.py:426)
    __init__ (socketserver.py:720)
    finish_request (socketserver.py:360)
    process_request_thread (socketserver.py:650)
    run (threading.py:865)
    run (sentry_sdk/integrations/threading.py:67)
    _bootstrap_inner (threading.py:917)
    _bootstrap (threading.py:885)
... and this stack trace repeated as well

Symptoms were that the log rotation ran, the signal was sent, logrotate was happy, then the service stopped responding to monitoring requests. No logs and i think no processing either.

@erikjohnston
Copy link
Member

Thread 52103 (idle): "MainThread"
    start (synapse/metrics/background_process_metrics.py:266)
    set_current_context (synapse/logging/context.py:628)
    __exit__ (synapse/logging/context.py:587)
    call_later (synapse/util/__init__.py:131)
    run_sighup (synapse/app/_base.py:263)
    start (synapse/metrics/background_process_metrics.py:267)
    ...

with _bg_metrics_lock:
_background_processes_active_since_last_scrape.add(self._proc)

Gah, looks like the SIGHUP happened when a lock was taken, and when we register the callback it attempts to fiddle with logcontexts (which causes it to try and update the stats).

I think the answer here is we shouldn't be using callLater but instead reactor.callFromThread (which is thread and signal safe, and bypasses all Synapse code entirely).

We might also want to add time outs to those locks too, since they should be short lived.

@erikjohnston erikjohnston added z-bug (Deprecated Label) p1 labels Dec 8, 2020
@erikjohnston erikjohnston self-assigned this Dec 10, 2020
erikjohnston added a commit that referenced this issue Dec 10, 2020
erikjohnston added a commit that referenced this issue Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-bug (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants