Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak in subscribeRepos rollback window #39

Closed
snarfed opened this issue Aug 20, 2024 · 17 comments
Closed

memory leak in subscribeRepos rollback window #39

snarfed opened this issue Aug 20, 2024 · 17 comments

Comments

@snarfed
Copy link
Owner

snarfed commented Aug 20, 2024

atproto-hub hung itself just now. Evidently we made and emitted a ton of commits all of a sudden, >20qps sustained during 10:45-11:15a PT, so ~36k total. Sheesh.

image image image image
@snarfed
Copy link
Owner Author

snarfed commented Aug 20, 2024

Restarting seems to have fixed it, but atproto-hub CPU is pegged at 100% working through the backlog right now, so we're not out of the woods just yet.

@snarfed
Copy link
Owner Author

snarfed commented Aug 20, 2024

So weird. I don't see a pattern in the usage spike yet. Mostly posts, from a range of users and AP instances and web sites. A few examples from 11:05-11:15a:

Doesn't look like we were backed up and then suddenly caught up either.

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 20, 2024

Out of the woods, everything looks back to normal. Hrmph.

@snarfed
Copy link
Owner Author

snarfed commented Aug 29, 2024

Happened again just now due to the influx of Brazil users and usage.

image

image

@snarfed
Copy link
Owner Author

snarfed commented Aug 30, 2024

Seems like a memory leak. atproto-hub's memory footprint is constant when it's caught up, but increases linearly, quickly, when it's behind. ☹️ Not 100% sure if this is in our firehose server or our client. subscribeRepos clients are reconnecting often right now, every 1-10m, but we're consistently catching up from their cursor and then serving new commits in realtime, so I suspect the memory leak is in our client.

image

snarfed referenced this issue in snarfed/bridgy-fed Aug 30, 2024
snarfed referenced this issue in snarfed/bridgy-fed Sep 3, 2024
trying to offload more CPU from the firehose client. for #1266
snarfed referenced this issue in snarfed/bridgy-fed Sep 3, 2024
switch to putting raw websocket frame bytes onto queue, then threads parse it. for #1266
@snarfed
Copy link
Owner Author

snarfed commented Sep 8, 2024

Related: snarfed/bridgy-fed#1295

@snarfed
Copy link
Owner Author

snarfed commented Nov 6, 2024

Haven't seen this since we optimized and switched from dag_cbor to libipld. Tentatively closing.

@snarfed snarfed closed this as completed Nov 6, 2024
@snarfed
Copy link
Owner Author

snarfed commented Nov 17, 2024

Reopening, still happening. Only when we're behind serving events over our firehose, so it's hard to debug, but definitely happening right now. 😕

@snarfed
Copy link
Owner Author

snarfed commented Nov 17, 2024

Bumping hub memory up to 6G as a band-aid.

@snarfed
Copy link
Owner Author

snarfed commented Nov 17, 2024

Ugh, we're flapping:

image

@snarfed
Copy link
Owner Author

snarfed commented Nov 18, 2024

I'm pretty confident this is in the rollback window part of subscribeRepos:

arroba/arroba/xrpc_sync.py

Lines 179 to 189 in 69846b5

if window := os.getenv('ROLLBACK_WINDOW'):
rollback_start = max(cur_seq - int(window) - 1, 0)
if cursor < rollback_start:
logger.warning(f'Cursor {cursor} is before our rollback window; starting at {rollback_start}')
yield ({'op': 1, 't': '#info'}, {'name': 'OutdatedCursor'})
cursor = rollback_start
logger.info(f'fetching existing events from seq {cursor}')
for event in server.storage.read_events_by_seq(start=cursor):
yield handle(event)

arroba/arroba/storage.py

Lines 309 to 325 in 69846b5

seen = [] # CIDs
for block in self.read_blocks_by_seq(start=start, repo=repo):
assert block.seq
if block.seq != seq: # switching to a new commit's blocks
if commit_block:
yield make_commit()
else:
# we shouldn't have any dangling blocks that we don't serve
assert not blocks
seq = block.seq
blocks = {} # maps CID to Block
commit_block = None
if block.decoded.get('$type', '').startswith(
'com.atproto.sync.subscribeRepos#'): # non-commit message
yield block.decoded
continue

while True:
ctx = context.get_context(raise_context_error=False)
with ctx.use() if ctx else self.ndb_client.context():
# lexrpc event subscription handlers like subscribeRepos call this
# on a different thread, so if we're there, we need to create a new
# ndb context
try:
query = AtpBlock.query(AtpBlock.seq >= cur_seq).order(AtpBlock.seq)
if repo:
query = query.filter(AtpBlock.repo == AtpRepo(id=repo).key)
# unproven hypothesis: need strong consistency to make sure we
# get all blocks for a given seq, including commit
# https://console.cloud.google.com/errors/detail/CO2g4eLG_tOkZg;service=atproto-hub;time=P1D;refresh=true;locations=global?project=bridgy-federated
for atp_block in query.iter(read_consistency=ndb.STRONG):
if atp_block.seq != cur_seq:
cur_seq = atp_block.seq
cur_seq_cids = []
if atp_block.key.id() not in cur_seq_cids:
cur_seq_cids.append(atp_block.key.id())
yield atp_block.to_block()
# finished cleanly
break
except ContextError as e:
logging.warning(f'lost ndb context! re-querying at {cur_seq}. {e}')
# continue loop, restart query

Moving this issue to the arroba repo.

@snarfed snarfed transferred this issue from snarfed/bridgy-fed Nov 18, 2024
@snarfed
Copy link
Owner Author

snarfed commented Nov 18, 2024

Recent example, two clients from the same IP connected to our subscribeRepos at the same time with a ~4h old cursor. We leaked memory while we were serving them events from the rollback window, and then reclaimed that memory as soon as we caught up and switched to live.

image image

@snarfed
Copy link
Owner Author

snarfed commented Nov 18, 2024

I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq? Doesn't seem like that should be too big, just the CIDs of each emitted block in the rollback window, but that could still add up. Worth looking at.

@snarfed snarfed changed the title atproto-hub hung after a big spike of commits memory leak in subscribeRepos rollback window Nov 18, 2024
@snarfed
Copy link
Owner Author

snarfed commented Nov 21, 2024

I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq?

Never mind, we don't actually do that. seen there is unused. 😆

Maybe ndb query caching?

@snarfed
Copy link
Owner Author

snarfed commented Nov 21, 2024

It's not a fix for the memory leak, but one thing that would help here would be to cache all of the rollback window's blocks in memory and serve them from there. That would also be half of #30

@snarfed
Copy link
Owner Author

snarfed commented Dec 19, 2024

Deprioritizing, this hasn't been happening much any more for a while now, but #30 is getting acute.

@snarfed snarfed removed the now label Dec 19, 2024
snarfed added a commit to snarfed/bridgy-fed that referenced this issue Dec 19, 2024
snarfed added a commit to snarfed/bridgy-fed that referenced this issue Dec 19, 2024
@snarfed
Copy link
Owner Author

snarfed commented Jan 28, 2025

Tentatively closing, this hasn't been a problem for a long time. Example 12h window below, the times when atproto-hub CPU was pegged at 100% were the times we were serving subscribeRepos from rollback:

Image

@snarfed snarfed closed this as completed Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant