Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

trial test runs leak memory #15622

Closed
3 of 6 tasks
squahtx opened this issue May 18, 2023 · 8 comments · Fixed by #15630
Closed
3 of 6 tasks

trial test runs leak memory #15622

squahtx opened this issue May 18, 2023 · 8 comments · Fixed by #15630
Assignees
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact

Comments

@squahtx
Copy link
Contributor

squahtx commented May 18, 2023

#15334 introduced a bug where HomeServer instances no longer get garbage collected during trial test runs. As a result, memory usage grows to multiple gigabytes during test runs.

See https://github.com/matrix-org/synapse/pull/15334/files#diff-f50ec854a65082453f4086048f20c291ca6020fab973486a9da2285025394cc8R180.
When we garbage collect generation 0, self.hs still holds a reference to the HomeServer, which gets promoted to generation 1 and then never collected.

Apart from that, there are also a couple of other leaks at play.


The last 3 are fairly annoying to fix, so I've left them alone.

@squahtx squahtx added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience labels May 18, 2023
@clokep clokep changed the title trial runs leak memory trial tests runs leak memory May 18, 2023
@clokep clokep changed the title trial tests runs leak memory trial test runs leak memory May 18, 2023
@squahtx
Copy link
Contributor Author

squahtx commented May 18, 2023

When I force a full GC every 50 or so gen0 GCs, a trial run no longer OOMs on my laptop. But memory usage still keeps growing and the GCs get slower and slower. So there's a separate memory leak going on.

According to tracemalloc, the most leaked objects are

    /home/squah/repos/synapse/synapse/util/caches/lrucache.py:483: size=26.2 MiB, count=217769, average=126 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/jinja2/environment.py:704: size=23.5 MiB, count=183329, average=134 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/jinja2/environment.py:1299: size=16.5 MiB, count=1515, average=11.2 KiB
    /home/squah/repos/synapse/synapse/storage/database.py:469: size=11.2 MiB, count=140743, average=83 B
    /usr/lib/python3.10/functools.py:58: size=10.3 MiB, count=168891, average=64 B
    /usr/lib/python3.10/functools.py:52: size=8257 KiB, count=93189, average=91 B
    /usr/lib/python3.10/linecache.py:137: size=6795 KiB, count=67319, average=103 B
    /home/squah/repos/synapse/synapse/util/caches/lrucache.py:759: size=4862 KiB, count=8643, average=576 B
    /home/squah/repos/synapse/synapse/storage/background_updates.py:547: size=4747 KiB, count=42392, average=115 B
    /home/squah/repos/synapse/synapse/config/_base.py:168: size=4571 KiB, count=55214, average=85 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/jinja2/environment.py:349: size=4557 KiB, count=4106, average=1137 B
    /home/squah/repos/synapse/synapse/util/caches/deferred_cache.py:108: size=4185 KiB, count=105808, average=40 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/twisted/internet/base.py:114: size=3669 KiB, count=12692, average=296 B
    /home/squah/repos/synapse/synapse/util/caches/lrucache.py:502: size=2971 KiB, count=25931, average=117 B
    /home/squah/repos/synapse/synapse/storage/database.py:192: size=2807 KiB, count=35984, average=80 B
    /usr/lib/python3.10/unittest/mock.py:415: size=2564 KiB, count=5466, average=480 B
    /home/squah/repos/synapse/synapse/storage/background_updates.py:509: size=2560 KiB, count=19521, average=134 B
    /usr/lib/python3.10/functools.py:61: size=2388 KiB, count=14553, average=168 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/jinja2/environment.py:350: size=2359 KiB, count=4108, average=588 B
    /home/squah/repos/synapse/.venv/lib/python3.10/site-packages/twisted/internet/defer.py:461: size=2306 KiB, count=29910, average=79 B

but this isn't too helpful in tracking down the leak.

@squahtx squahtx self-assigned this May 18, 2023
@squahtx
Copy link
Contributor Author

squahtx commented May 18, 2023

One of the leaks is due to

_rate_limiter_instances: Set["FederationRateLimiter"] = set()

Another "leak" is due to the config cache added in #15284 growing large (~80 entries).

@squahtx
Copy link
Contributor Author

squahtx commented May 18, 2023

Another "leak" is due to the config cache added in #15284 growing large.

I'm going to leave this one alone. I'd replace things with functools.lru_cache, except config dicts are unhashable and there's no way to give lru_cache a custom cache key.

@DMRobertson
Copy link
Contributor

except config dicts are unhashable

FWIW I think immutabledict (previously frozendict) is hashable?

squahtx added a commit that referenced this issue May 19, 2023
This change fixes two memory leaks during `trial` test runs.

Garbage collection is disabled during each test case and a gen-0 GC is
run at the end of each test. However, when the gen-0 GC is run, the
`TestCase` object usually still holds references to the `HomeServer`
used during the test. As a result, the `HomeServer` gets promoted to
gen-1 and then never garbage collected.

Fix this by periodically running full GCs.

Additionally, fix `HomeServer`s leaking after tests that touch inbound
federation due to `FederationRateLimiter`s adding themselves to a global
set, by turning the set into a `WeakSet`.

Resolves #15622.

Signed-off-by: Sean Quah <[email protected]>
@squahtx
Copy link
Contributor Author

squahtx commented May 19, 2023

except config dicts are unhashable

FWIW I think immutabledict (previously frozendict) is hashable?

That's true, we could frozendictify the config dicts and use them as cache keys. I'd balk at the copying involved, but the current code does a json.dumps to derive the cache key every time it's called.

@squahtx squahtx reopened this May 19, 2023
@squahtx
Copy link
Contributor Author

squahtx commented May 22, 2023

There are two three more HomeServer "leaks":

caches_by_name and CACHE_METRIC_REGISTRY indirectly hold references to multiple homeservers from previous tests. Luckily, the number of homeservers leaked is bounded and comes out to fewer than 10 in practice.

squahtx added a commit that referenced this issue May 22, 2023
…5646)

...to try to control memory usage. `HomeServerConfig`s hold on to
many Jinja2 objects, which come out to over 0.5 MiB per config.

Over the course of a full test run, the cache grows to ~360 entries.
Limit it to 8 entries.

Part of #15622.

Signed-off-by: Sean Quah <[email protected]>
@squahtx
Copy link
Contributor Author

squahtx commented May 22, 2023

There are two three more HomeServer "leaks":

caches_by_name and CACHE_METRIC_REGISTRY indirectly hold references to multiple homeservers from previous tests. Luckily, the number of homeservers leaked is bounded and comes out to fewer than 10 in practice.

caches_by_name can be turned into a WeakValueDictionary.

CACHE_METRIC_REGISTRY._pre_update_hooks[].__self__._cache is a lot messier. A weakref can be used for some of the caches. Other caches are simple lists or dicts, which weakref does not support.

The cache resizing callbacks are also painful to fix, even when using WeakMethods.

I'm going to leave these alone.

@squahtx squahtx closed this as completed May 22, 2023
@squahtx
Copy link
Contributor Author

squahtx commented May 22, 2023

I've left the hacky code I used on the squah/trial_memory_leak_tracking branch.
df61544

Usage: run the tests, wait for the breakpoint to be hit, count objects of each type using Counter(str(type(o)) for o in gc.get_objects(2)), pick out a suspicious looking object and run dump_paths() on it with increasing depth.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Dev-Wishlist Makes developers' lives better, but doesn't have direct user impact
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants