memory leak in `subscribeRepos` rollback window #39

snarfed · 2024-08-20T18:44:41Z

atproto-hub hung itself just now. Evidently we made and emitted a ton of commits all of a sudden, >20qps sustained during 10:45-11:15a PT, so ~36k total. Sheesh.

snarfed · 2024-08-20T18:45:41Z

Restarting seems to have fixed it, but atproto-hub CPU is pegged at 100% working through the backlog right now, so we're not out of the woods just yet.

snarfed · 2024-08-20T18:54:02Z

So weird. I don't see a pattern in the usage spike yet. Mostly posts, from a range of users and AP instances and web sites. A few examples from 11:05-11:15a:

Doesn't look like we were backed up and then suddenly caught up either.

snarfed · 2024-08-20T19:05:43Z

Out of the woods, everything looks back to normal. Hrmph.

snarfed · 2024-08-29T22:28:48Z

Happened again just now due to the influx of Brazil users and usage.

snarfed · 2024-08-30T16:26:15Z

Seems like a memory leak. atproto-hub's memory footprint is constant when it's caught up, but increases linearly, quickly, when it's behind. ☹️ Not 100% sure if this is in our firehose server or our client. subscribeRepos clients are reconnecting often right now, every 1-10m, but we're consistently catching up from their cursor and then serving new commits in realtime, so I suspect the memory leak is in our client.

for #1266

…ytes for https://github.com/snarfed/bridgy-fed/issues/1266

trying to offload more CPU from the firehose client. for #1266

switch to putting raw websocket frame bytes onto queue, then threads parse it. for #1266

snarfed · 2024-09-08T03:33:27Z

Related: snarfed/bridgy-fed#1295

snarfed · 2024-11-06T21:02:10Z

Haven't seen this since we optimized and switched from dag_cbor to libipld. Tentatively closing.

snarfed · 2024-11-17T18:17:08Z

Reopening, still happening. Only when we're behind serving events over our firehose, so it's hard to debug, but definitely happening right now. 😕

snarfed · 2024-11-17T18:35:18Z

Bumping hub memory up to 6G as a band-aid.

snarfed · 2024-11-17T20:09:57Z

Ugh, we're flapping:

snarfed · 2024-11-18T21:25:49Z

I'm pretty confident this is in the rollback window part of subscribeRepos:

arroba/arroba/xrpc_sync.py

Lines 179 to 189 in 69846b5

    
           if window := os.getenv('ROLLBACK_WINDOW'): 
        
               rollback_start = max(cur_seq - int(window) - 1, 0) 
        
               if cursor < rollback_start: 
        
                   logger.warning(f'Cursor {cursor} is before our rollback window; starting at {rollback_start}') 
        
                   yield ({'op': 1, 't': '#info'}, {'name': 'OutdatedCursor'}) 
        
                   cursor = rollback_start 
        
           logger.info(f'fetching existing events from seq {cursor}') 
        
           for event in server.storage.read_events_by_seq(start=cursor): 
        
               yield handle(event)

arroba/arroba/storage.py

Lines 309 to 325 in 69846b5

    
           seen = []  # CIDs 
        
           for block in self.read_blocks_by_seq(start=start, repo=repo): 
        
               assert block.seq 
        
               if block.seq != seq:  # switching to a new commit's blocks 
        
                   if commit_block: 
        
                       yield make_commit() 
        
                   else: 
        
                       # we shouldn't have any dangling blocks that we don't serve 
        
                       assert not blocks 
        
                   seq = block.seq 
        
                   blocks = {}  # maps CID to Block 
        
                   commit_block = None 
        
               if block.decoded.get('$type', '').startswith( 
        
                       'com.atproto.sync.subscribeRepos#'):  # non-commit message 
        
                   yield block.decoded 
        
                   continue

arroba/arroba/datastore_storage.py

Lines 528 to 554 in 69846b5

    
           while True: 
        
               ctx = context.get_context(raise_context_error=False) 
        
               with ctx.use() if ctx else self.ndb_client.context(): 
        
                   # lexrpc event subscription handlers like subscribeRepos call this 
        
                   # on a different thread, so if we're there, we need to create a new 
        
                   # ndb context 
        
                   try: 
        
                       query = AtpBlock.query(AtpBlock.seq >= cur_seq).order(AtpBlock.seq) 
        
                       if repo: 
        
                           query = query.filter(AtpBlock.repo == AtpRepo(id=repo).key) 
        
                       # unproven hypothesis: need strong consistency to make sure we 
        
                       # get all blocks for a given seq, including commit 
        
                       # https://console.cloud.google.com/errors/detail/CO2g4eLG_tOkZg;service=atproto-hub;time=P1D;refresh=true;locations=global?project=bridgy-federated 
        
                       for atp_block in query.iter(read_consistency=ndb.STRONG): 
        
                           if atp_block.seq != cur_seq: 
        
                               cur_seq = atp_block.seq 
        
                               cur_seq_cids = [] 
        
                           if atp_block.key.id() not in cur_seq_cids: 
        
                               cur_seq_cids.append(atp_block.key.id()) 
        
                               yield atp_block.to_block() 
        
                       # finished cleanly 
        
                       break 
        
                   except ContextError as e: 
        
                       logging.warning(f'lost ndb context! re-querying at {cur_seq}. {e}') 
        
                       # continue loop, restart query

Moving this issue to the arroba repo.

snarfed · 2024-11-18T21:36:32Z

Recent example, two clients from the same IP connected to our subscribeRepos at the same time with a ~4h old cursor. We leaked memory while we were serving them events from the rollback window, and then reclaimed that memory as soon as we caught up and switched to live.

snarfed · 2024-11-18T21:39:18Z

I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq? Doesn't seem like that should be too big, just the CIDs of each emitted block in the rollback window, but that could still add up. Worth looking at.

snarfed · 2024-11-21T05:12:26Z

I wonder if this is our tracking of seen CIDs in Storage.read_events_by_seq?

Never mind, we don't actually do that. seen there is unused. 😆

Maybe ndb query caching?

snarfed · 2024-11-21T05:14:34Z

It's not a fix for the memory leak, but one thing that would help here would be to cache all of the rollback window's blocks in memory and serve them from there. That would also be half of #30

snarfed · 2024-12-19T19:32:58Z

Deprioritizing, this hasn't been happening much any more for a while now, but #30 is getting acute.

roughly 12-18h. for snarfed/arroba#30, snarfed/arroba#39

for snarfed/arroba#39

snarfed · 2025-01-28T22:29:44Z

Tentatively closing, this hasn't been a problem for a long time. Example 12h window below, the times when atproto-hub CPU was pegged at 100% were the times we were serving subscribeRepos from rollback:

snarfed referenced this issue in snarfed/bridgy-fed Aug 30, 2024

atproto_firehose: run multiple handle threads. start with 10

19430a2

for #1266

snarfed referenced this issue in snarfed/lexrpc Sep 1, 2024

client: temporarily switch subscriptions to returning frames as raw b…

3f9d4f2

…ytes for https://github.com/snarfed/bridgy-fed/issues/1266

snarfed mentioned this issue Sep 1, 2024

Bluesky = fediverse: can't yet handle load from influx of new Brazilian users snarfed/bridgy-fed#1295

Closed

snarfed referenced this issue in snarfed/bridgy-fed Sep 3, 2024

atproto_firehose: move most logic from subscribe to handle

5b280dc

trying to offload more CPU from the firehose client. for #1266

snarfed referenced this issue in snarfed/bridgy-fed Sep 3, 2024

atproto_firehose: parallelize CBOR parsing

e7c6ced

switch to putting raw websocket frame bytes onto queue, then threads parse it. for #1266

snarfed closed this as completed Nov 6, 2024

snarfed reopened this Nov 17, 2024

snarfed added the now label Nov 17, 2024

TomCasavant mentioned this issue Nov 17, 2024

Posts not being bridged to bluesky snarfed/bridgy-fed#1367

Closed

snarfed transferred this issue from snarfed/bridgy-fed Nov 18, 2024

snarfed changed the title ~~atproto-hub hung after a big spike of commits~~ memory leak in subscribeRepos rollback window Nov 18, 2024

snarfed removed the now label Dec 19, 2024

snarfed added a commit to snarfed/bridgy-fed that referenced this issue Dec 19, 2024

atproto-hub: drop subscribeRepos rollback window back down to 50k seqs

ca6e37b

roughly 12-18h. for snarfed/arroba#30, snarfed/arroba#39

snarfed added a commit to snarfed/bridgy-fed that referenced this issue Dec 19, 2024

bump atproto-hub up to 2 cores

305f23f

for snarfed/arroba#39

snarfed closed this as completed Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak in `subscribeRepos` rollback window #39

memory leak in `subscribeRepos` rollback window #39

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 29, 2024

snarfed commented Aug 30, 2024 •

edited

Loading

snarfed commented Sep 8, 2024

snarfed commented Nov 6, 2024

snarfed commented Nov 17, 2024

snarfed commented Nov 17, 2024 •

edited

Loading

snarfed commented Nov 17, 2024

snarfed commented Nov 18, 2024 •

edited

Loading

snarfed commented Nov 18, 2024

snarfed commented Nov 18, 2024

snarfed commented Nov 21, 2024

snarfed commented Nov 21, 2024

snarfed commented Dec 19, 2024

snarfed commented Jan 28, 2025

memory leak in subscribeRepos rollback window #39

memory leak in subscribeRepos rollback window #39

Comments

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 20, 2024

snarfed commented Aug 29, 2024

snarfed commented Aug 30, 2024 • edited Loading

snarfed commented Sep 8, 2024

snarfed commented Nov 6, 2024

snarfed commented Nov 17, 2024

snarfed commented Nov 17, 2024 • edited Loading

snarfed commented Nov 17, 2024

snarfed commented Nov 18, 2024 • edited Loading

snarfed commented Nov 18, 2024

snarfed commented Nov 18, 2024

snarfed commented Nov 21, 2024

snarfed commented Nov 21, 2024

snarfed commented Dec 19, 2024

snarfed commented Jan 28, 2025

memory leak in `subscribeRepos` rollback window #39

memory leak in `subscribeRepos` rollback window #39

snarfed commented Aug 30, 2024 •

edited

Loading

snarfed commented Nov 17, 2024 •

edited

Loading

snarfed commented Nov 18, 2024 •

edited

Loading