-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB Archival Failing (out of space causing site-wide outage) #10082
Comments
Confirmed, this bug appeared within the last 16 hours. My last test it was working prior to then. |
Same issue for me. But it doesn't happen for all urls. On /search, searches based on Author and Subject do return results, though when I click on the results to open their pages, the error shows up. A database error is being displayed on certain pages like this one https://openlibrary.org/works/OL18203673W/The_Seven_Husbands_of_Evelyn_Hugo : While a lot of the other ones like this search attempt show this error: The successful result for Subject and Author based search maybe helpful in identifying which resources are isolated from the bug. |
Thank you all, staff is debugging, we're converting this into a |
Appreciate the hard work, thanks for the update. |
Summary
Every URL in the This issue started some time within the last 24 hours.
Website offline due to out of space on Several days later, we were able to determine the cause of the problem and the fix. Archival had failed from We removed a 2gb ballast files that @mekarpeles and @samuel-archive had created back in the day (presciently predicting such an event?) which cleared space on Still, archival and defill/drain on ol-db1 was not happening. Our next steps were:
TL;DR, moving some xlogs (not the oldest, not the newest) temporarily out of
During this process, there were a few times weird collisions occurred where
We also had a case where two files in
First, we removed a 2gb ballast file that was created in the past via Next, we began the process of freeing up space on ol-db1 to restore service to openlibrary, but we needed to do so in a way that doesn't impact ol-db1 db, break replication to Ultimately, we moved ~1000 old 16mb xlogs temporarily from The remaining step is to check the state of replication from primary v. replica.
Several hours of downtime on the website.
Reporting when space was low on ol-backup0 or some sort of error alert as to the underlying issue
Document the process by which ol-db1 replicates to ol-db2 and drains wal/xlog into ol-backup0 via
Replication setup remains something of a mystery for a different issue :) Steps to close
|
The site is back up, we're still monitoring and keeping this issue open as:
|
We are at 100% full again on |
Problem:
|
Freeing Up Space on
|
Crons do somehow seem to be running on
No idea yet how they are getting registered because EDIT: Remember how we said
|
Even if our
It's likely related to this cron:
The files in UPDATE: Running The final question is what is causing these files to be owned by For this, on |
Jump to the postmortem tracking comment here.
Problem
Every URL in the
openlibrary.org
domain returns the following error message:{"error": "internal_error", "message": "get_site(openlibrary.org) failed"}
This issue started some time within the last 24 hours.
Next Steps
Most recent case is #10082 (comment) and @mekarpeles writes:
openlibrary python /petabox/sw/postgres/prune-backups.py -d 7 -w 6 -m 12 /1/postgres-backups/backups/pgdump_openlibrary* 2>&1 | logger -t pg-backups
ol-home0
cron container trying to runolsystem/etc/cron.d/pg-backups
cron (and if it is, why it's failing)* The cron is being run on
ol-backup0
from/etc/cron.d
which has symlinks toolsystem
😄pg_xlog_archive
files are owned bymunin:ssl-cert
and notopenlibrary:openlibrary
pg_xlog_archive
and failure of cron to clear old filesThe text was updated successfully, but these errors were encountered: