Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DB Archival Failing (out of space causing site-wide outage) #10082

Open
4 of 6 tasks
zorae opened this issue Nov 25, 2024 · 11 comments
Open
4 of 6 tasks

DB Archival Failing (out of space causing site-wide outage) #10082

zorae opened this issue Nov 25, 2024 · 11 comments
Assignees
Labels
Affects: Operations Affects the IA DevOps folks Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue

Comments

@zorae
Copy link

zorae commented Nov 25, 2024

Jump to the postmortem tracking comment here.

Problem

Every URL in the openlibrary.org domain returns the following error message:
{"error": "internal_error", "message": "get_site(openlibrary.org) failed"}

This issue started some time within the last 24 hours.

Next Steps

Most recent case is #10082 (comment) and @mekarpeles writes:

  • Manually run openlibrary python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_openlibrary* 2>&1 | logger -t pg-backups
    • see if it even works or fix it
  • Investigate if ol-home0 cron container trying to run olsystem/etc/cron.d/pg-backups cron (and if it is, why it's failing)
    * The cron is being run on ol-backup0 from /etc/cron.d which has symlinks to olsystem 😄
  • Figure out why pg_xlog_archive files are owned by munin:ssl-cert and not openlibrary:openlibrary
  • Update cron parameters so we don't run out of space as the DB backup size grows
    • likely no longer needed as the problem may have been related to ownership of pg_xlog_archive and failure of cron to clear old files
  • Update ol-db1 / ol-backup0 migration plan Upgrade All Trusty Nodes to Ubuntu Latest #2036 Provisioning & Deployment Plan #4423 Update production version of postgres 😬 #5675 based on these learnings
@zorae zorae added Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Type: Bug Something isn't working. [managed] labels Nov 25, 2024
@RydalWater
Copy link

RydalWater commented Nov 25, 2024

Confirmed, this bug appeared within the last 16 hours. My last test it was working prior to then.

https://openlibrary.org/isbn/9780061120084.json

image

@Spaarsh
Copy link
Contributor

Spaarsh commented Nov 25, 2024

Same issue for me. But it doesn't happen for all urls. On /search, searches based on Author and Subject do return results, though when I click on the results to open their pages, the error shows up.

A database error is being displayed on certain pages like this one https://openlibrary.org/works/OL18203673W/The_Seven_Husbands_of_Evelyn_Hugo :
Screenshot_20241125_185827_Chrome

While a lot of the other ones like this search attempt show this error:
{"error": "internal_error", "message": "'NoneType' object has no attribute 'things'"}

The successful result for Subject and Author based search maybe helpful in identifying which resources are isolated from the bug.

@mekarpeles
Copy link
Member

Thank you all, staff is debugging, we're converting this into a postmortem issue. We broadly understand what is happening and are freeing up space on two servers to unblock an archival process that has been stalled due to recent network changes within our infrastructure.

@mekarpeles mekarpeles added Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Type: Post-Mortem Log for when having to resolve a P0 issue Affects: Operations Affects the IA DevOps folks and removed Type: Bug Something isn't working. [managed] Needs: Lead Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Nov 25, 2024
@RydalWater
Copy link

Thank you all, staff is debugging, we're converting this into a postmortem issue. We broadly understand what is happening and are freeing up space on two servers to unblock an archival process that has been stalled due to recent network changes within our infrastructure.

Appreciate the hard work, thanks for the update.

@mekarpeles
Copy link
Member

mekarpeles commented Nov 25, 2024

Summary

  • What is wrong?

Every URL in the openlibrary.org domain returns the following error message:
{"error": "internal_error", "message": "get_site(openlibrary.org) failed"}

This issue started some time within the last 24 hours.

  • What caused it?

Website offline due to out of space on ol-db1:/1

Several days later, we were able to determine the cause of the problem and the fix.

Archival had failed from ol-db1, presumably because ol-backup0 ran out of storage. We think because of all the work changes and services being offline, ol-backup0 ran out of space and things got stuck.

We removed a 2gb ballast files that @mekarpeles and @samuel-archive had created back in the day (presciently predicting such an event?) which cleared space on ol-backup0... This process seemed to have unblocked some running archival process which was stuck, because once those 2gb's were freed, we then saw ~30gb free on ol-backup0.

Still, archival and defill/drain on ol-db1 was not happening. Our next steps were:

  1. we moved some of the xlogs from /2/postgresql/9.3/main/pg_xlog to tmp dir /1/2024_backup to free up space on ol-db1 which allowed us to restart posgres and get the site back live and restarted for good measure
  2. because archival needed the oldest files which we have moved out, archival broken, which we learned by checking the logs on ol-db1:/var/log/postgresql/postgresql-9.3-main.log which showed that the files it needed to archive were now in /1/2024_backup and not where expected
  3. we moved small batches of the earliest wals back from /1 to /2 and archival drain continued.

TL;DR, moving some xlogs (not the oldest, not the newest) temporarily out of /2 is what we should have done as this would have:

  • freed up space
  • not interrupted new backups
  • not broken archival of the next-up wals

During this process, there were a few times weird collisions occurred where ol-db1 couldn't rsync a file to ol-backup0 because it "already exists". In these cases, we manually ssh'd to ol-backup0 and cd'd to ol-backup0:/1/postgres-backups/pg_xlog_archive/ into the item ol-db1's postgres log (i.e. ol-db1:/var/log/postgres/postgresql-9.3-main.log) was complaining about an item, e.g. 000000010000169F and mv'd its .gz file to include a leading _:

root@ol-backup0:/1/postgres-backups/pg_xlog_archive/000000010000169F# mv 000000010000169F000000BD.gz _000000010000169F000000BD.gz

We also had a case where two files in ol-db1:/2/postgresql/9.3/main/pg_xlogs were somehow owned by root and so we chown postgres:postgres those files and then things were happy.

  • Five Whys
  • Why? ol-db1:/1 full because of psql archival partials (and possible replication to ol-db2 failing)
  • Why? Because these are typically transferred to ol-backup0 which is/was out of space.
  • Why? We're not sure yet, we should figure this out. The first step is to free up ol-backup0 /1
  • What fixed it?

First, we removed a 2gb ballast file that was created in the past via dd if=/dev/zero of=/1/2gb-ballast-prevent-diskfull bs=1M count=2048
By doing this, we noticed some other job may have been able to complete which freed up an addition 500GB of space.
We then recreated the ballast file to pay it forward...

Next, we began the process of freeing up space on ol-db1 to restore service to openlibrary, but we needed to do so in a way that doesn't impact ol-db1 db, break replication to ol-db2, or disrupt our backups on ol-backup0.

Ultimately, we moved ~1000 old 16mb xlogs temporarily from /2 to /1 to free up 16gb of space on ol-db1 and then restarted postgres and created a 2gb ballast on /2 to create buffer for if we run out of space again.

The remaining step is to check the state of replication from primary v. replica.

  • What was the impact?

Several hours of downtime on the website.

  • What could have gone better?

Reporting when space was low on ol-backup0 or some sort of error alert as to the underlying issue

  • Followup actions:

Document the process by which ol-db1 replicates to ol-db2 and drains wal/xlog into ol-backup0 via /olsystem/bin/postgres/pg_wal_archive.sh (run from ol-db1 by postgres)... which we'll do here:

ol-db1 lists in its postgres config that it saves x# wal files and defines an archive command to rsync to ol-backup0... which it does by taking things from /2/ and gzipping and moving them to /1 and then rsyncing to ol-backup0 and then removing them.

Replication setup remains something of a mystery for a different issue :)

Steps to close

  1. Assignment: Is someone assigned to this issue? (notetaker, responder)
  2. Labels: Is there an Affects: label applied?
  3. Diagnosis: Add a description and scope of the issue
  4. Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. "What caused it?" - please answer in summary
  6. "What fixed it?" - please answer in summary
  7. "Followup actions:" actions added to summary

@mekarpeles
Copy link
Member

The site is back up, we're still monitoring and keeping this issue open as:

  1. the primary db server has very little free space and needs to have archival run
  2. we need to check that replication is up to date on the servers before manually moving or cleaning up space

@mekarpeles mekarpeles added Priority: 1 Do this week, receiving emails, time sensitive, . [managed] and removed Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] labels Nov 25, 2024
@mekarpeles mekarpeles self-assigned this Nov 25, 2024
@mekarpeles mekarpeles changed the title Internal server error breaking OL completely DB Archival Failing (out of space causing site-wide outage) Dec 2, 2024
@mekarpeles
Copy link
Member

We are at 100% full again on ol-backup0

@mekarpeles mekarpeles reopened this Jan 12, 2025
@mekarpeles mekarpeles added this to the Sprint 2025-01 milestone Jan 12, 2025
@mekarpeles
Copy link
Member

mekarpeles commented Jan 12, 2025

Problem: ol-backup0 out of space, not pruning backups

Description: ol-backup0 is at 100% fill (11 TB), ol-db1 looks a bit better with 78 GB free. We have a 1gb ballast on ol-backup0 as a stop-gap.

Core Question:

  • Shouldn't old backups in /1/postgres-backups/backups be eventually getting pruned?
    • yes, they should, as per the olsystem/etc/cron.d/pg-backups cron
  • Why aren't they? Or when did they stop getting pruned?
    • See Investigation, likely because the relevant cron would run on ol-home0 and was either never setup or the setup broke when we locked down the cluster ssh

Symptoms

ol-backup0 is out of space and pg logs at ol-db1:/var/log/postgresql/postgresql-9.3-main.log are showing a similar problem:

2025-01-12 14:06:23 UTC  DETAIL:  The failed archive command was: /olsystem/bin/postgres/pg_wal_archive.sh pg_xlog/00000001000017740000008F 00000001000017740000008F

/1/postgres-backups/backups in particular seems like it has ~50 large (up to 70GB each) backups spanning a year (which seems like more than I'd expect we'd want to keep).

pgdump_coverstore_20240201/  pgdump_coverstore_20250103/   pgdump_openlibrary_20241001/
pgdump_coverstore_20240301/  pgdump_coverstore_20250104/   pgdump_openlibrary_20241126/
pgdump_coverstore_20240401/  pgdump_coverstore_20250105/   pgdump_openlibrary_20241201/

A single backup @ol-backup0:/1/postgres-backups/backups/pgdump_openlibrary_20250111 is 71GB, so this backups dir is likely VERY large ~4tb (50 backups)

Previous solution

What fixed ol-db1 out-of-space last time was removing the 2gb ballast file on ol-backup0 and letting archival catch up...

I removed the 2gb ballast and added a 1gb ballast and ol-db1 is back at archiving... However ol-backup0 doesn't seem to be doing its part.

After catching up with archives from ol-db1, ol-backup0 was left with 300MB of free space out of 11GB (+ 1GB ballast) and falling.

Investigation

Where to start? What is creating these backups. No scripts in /1/ really, so my next guess is olsystem which, I happen to know via "inside baseball", has a pattern of symlinking files from olsystem/etc directory into root /etc so they get run.

ol-backup0 is a legacy trusty machine which has a very old version of olsystem at /opt/openlibrary/olsystem on the root partition. So I ran: openlibrary@ol-backup0:/opt/openlibrary/olsystem$ grep -rli pgdump

/opt/openlibrary/olsystem/bin/postgres/pg-backup.sh
/opt/openlibrary/olsystem/etc/cron.d/pg-backups

This reveals that pg-backup.sh is the only script in olsystem that creates/touches these pg_dump_ backups and that pg-backups cron is the only obvious thing that runs pg-backup.sh

https://github.com/internetarchive/olsystem/blob/master/etc/cron.d/pg-backups

# Cron file to be installed on the db backup node (currently ol-home)
# once a day, take pg_dump backup of openlibrary and coverstore databases
0 6 * * * openlibrary /olsystem/bin/postgres/pg-backup.sh -h ol-db1 -d /1/postgres-backups/backups openlibrary 2>&1 | logger -t pg-backups
30 7 * * * openlibrary /olsystem/bin/postgres/pg-backup.sh -h ol-db1 -d /1/postgres-backups/backups coverstore 2>&1 | logger -t pg-backups

# every sunday, take base backup of production database
0 9 * * 0 openlibrary /olsystem/bin/postgres/pg-basebackup.sh ol-db1 replicator /etc/postgres-replication-pw.txt /1/postgres-backups/base_backups 2>&1 | logger -t pg-backups

# once a day, delete WAL logs older than 55 days
0 4  * * * openlibrary find /1/postgres-backups/pg_xlog_archive -type f -mtime +55 -delete 2>&1 | logger -t pg-backups

# prune backups
20 4  * * * openlibrary python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_openlibrary* 2>&1 | logger -t pg-backups
30 4  * * * openlibrary python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_coverstore* 2>&1 | logger -t pg-backups
40 4  * * * openlibrary python /petabox/sw/postgres/prune-backups.py -d 2 -w 2 -m 3 /1/postgres-backups/base_backups/pgbase_* 2>&1 | logger -t pg-backups

We can rule out the file pg-backup.sh (which creates our dumps) as being responsible for rotating old backups because it tells us explicitly on https://github.com/internetarchive/olsystem/blob/master/bin/postgres/pg-backup.sh#L12-L13:

# This does not remove old backups, that should taken care separately
# either using a cron or logrotate to avoid disk fill up.

In the cron, however, prune-backups.py seems like what we want to be running... So why isn't it?

[Where hypothetically] Is this cron run?

Neither ol-backup0 nor ol-db1 seem to have any crons listed in crontab -l and so the question rises, where is (or should) this cron be running from?

The cron file gives us a hint that maybe ol-home is responsible (replaced by newer ol-home0). It's unclear that ol-home0 is even trying to run this cron. Even if it is, recent tightening of our network would have made it so most of these scripts (if they were running on ol-home0 on behalf of ol-backup0) are likely now broken.

@mekarpeles
Copy link
Member

Freeing Up Space on ol-backup0 & Testing Cron Command

I manually ran the command from the cron, directly from ol-backup0, i.e.
python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_openlibrary* (i.e. keep 7 daily backups, 6 weekly backups and 12 monthly backups)

And it reported:

KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250112
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250111
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250110
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250109
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250108
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250107
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250106
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250105
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20250101
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241229
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241222
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241215
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241208
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241201
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241126
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20241001
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240901
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240801
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240701
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240601
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240501
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240401
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240301
KEEP /1/postgres-backups/backups/pgdump_openlibrary_20240201

So for now, I dropped 2 months from the command:
python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 10 /1/postgres-backups/backups/pgdump_openlibrary*

And we now have 131GB free. I'm hoping this frees up our other crons (e.g. ol-backup0 archival) but it may require we permanently update our prune command to save fewer backups.

@mekarpeles
Copy link
Member

mekarpeles commented Jan 12, 2025

Crons do somehow seem to be running on ol-backup0.

mek@ol-backup0:~$ sudo grep CRON /var/log/syslog | grep pg

Jan 12 04:00:01 ol-backup0.us.archive.org CRON[25174]: (openlibrary) CMD (find /1/postgres-backups/pg_xlog_archive -type f -mtime +55 -delete 2>&1 | logger -t pg-backups)
Jan 12 04:20:02 ol-backup0.us.archive.org CRON[27553]: (openlibrary) CMD (python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_openlibrary* 2>&1 | logger -t pg-backups)
Jan 12 04:30:01 ol-backup0.us.archive.org CRON[28750]: (openlibrary) CMD (python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_coverstore* 2>&1 | logger -t pg-backups)
Jan 12 04:40:01 ol-backup0.us.archive.org CRON[29630]: (openlibrary) CMD (python /petabox/sw/postgres/prune-backups.py -d 2 -w 2 -m 3 /1/postgres-backups/base_backups/pgbase_* 2>&1 | logger -t pg-backups)
Jan 12 06:00:01 ol-backup0.us.archive.org CRON[3241]: (openlibrary) CMD (/olsystem/bin/postgres/pg-backup.sh -h ol-db1 -d /1/postgres-backups/backups openlibrary 2>&1 | logger -t pg-backups)
Jan 12 07:30:01 ol-backup0.us.archive.org CRON[10154]: (openlibrary) CMD (/olsystem/bin/postgres/pg-backup.sh -h ol-db1 -d /1/postgres-backups/backups coverstore 2>&1 | logger -t pg-backups)
Jan 12 09:00:01 ol-backup0.us.archive.org CRON[17096]: (openlibrary) CMD (/olsystem/bin/postgres/pg-basebackup.sh ol-db1 replicator /etc/postgres-replication-pw.txt /1/postgres-backups/base_backups 2>&1 | logger -t pg-backups)

No idea yet how they are getting registered because crontab -l (both for root and openlibrary) seem to have no matches

EDIT: Remember how we said olsystem does some symlinking stuff? Exhibit A:

mek@ol-backup0:~$ ls -althr /etc/cron.d
total 32K
lrwxrwxrwx   1 root root   31 Feb 17  2015 pg-backups -> /olsystem/etc/cron.d/pg-backups

@mekarpeles
Copy link
Member

mekarpeles commented Jan 12, 2025

Even if our pg-backups are large, it's worth investigating why our pg_xlog_archive is so large.

mek@ol-backup0:/1/postgres-backups/pg_xlog_archive$ sudo du -sh
7.1T

It's likely related to this cron:

0 4  * * * openlibrary find /1/postgres-backups/pg_xlog_archive -type f -mtime +55 -delete 2>&1 | logger -t pg-backups

The files in /1/postgres-backups/pg_xlog_archive seem to be owned by:
munin ssl-cert
But the cron is run as openlibrary, so maybe that's it?..

UPDATE: Running sudo find /1/postgres-backups/pg_xlog_archive -type f -mtime +55 -delete has freed up +200GB and counting

The final question is what is causing these files to be owned by munin ssl-cert and how to fix it.

For this, on ol-backup0 we run: sudo apt-get install inotify-tools and then sudo inotifywait -m /1/postgres-backups/pg_xlog_archive to see how those files are getting created.

@mekarpeles mekarpeles added Priority: 2 Important, as time permits. [managed] and removed Priority: 1 Do this week, receiving emails, time sensitive, . [managed] labels Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Affects: Operations Affects the IA DevOps folks Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Needs: Breakdown This big issue needs a checklist or subissues to describe a breakdown of work. [managed] Priority: 2 Important, as time permits. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue
Projects
None yet
Development

No branches or pull requests

4 participants