Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node.js March 17th Infrastructure #4

Open
diaspd opened this issue Mar 29, 2023 · 0 comments
Open

Node.js March 17th Infrastructure #4

diaspd opened this issue Mar 29, 2023 · 0 comments

Comments

@diaspd
Copy link
Owner

diaspd commented Mar 29, 2023

Starting on March 15th and going through to March 17th (with much of the issue being mitigated on the 16th), users were receiving intermittent 404 responses when trying to download Node.js from nodejs.org, or even accessing parts of the website.

What surfaced as one issue, users seeing 404s, was actually the result of multiple compounding issues all happening at once:

The Cloudflare config for nodejs.org had multiple Page Rules configured with "Cache Everything" enabled alongside "Browser/Edge Cache TTLs" being set. This resulted in any 404 responses from the origin server being cached by Cloudflare.

The origin server for nodejs.org was being overloaded, and due to a recent configuration change, was now erroneously returning 404 responses for valid files when it was overloaded and couldn't read them from its disk.

A build script bug caused by the recent migration of the website to Next.js resulted in the site being rebuilt and the entire Cloudflare cache being purged every five minutes, sending an abnormally high number of requests to the origin server and putting it under excessive load.

Together, these resulted in the origin server receiving far more requests than normal, failing to read files from its disk that should've existed, returning incorrect 404 responses, and Cloudflare then caching those 404s.

A large team came together to investigate this issue, with the full fix being put in place at approximately 8 PM UTC on the 16th. Many of the Cloudflare Page Rules that had been removed as part of this fix (that were causing the 404s to be cached) had cache TTLs for four hours, and so the full impact of this incident was not resolved until midnight UTC on the 17th.

Related to the 404s users were receiving, it was also discovered that requests to localised pages on nodejs.org weren't resolving as expected when a localised version didn't exist. Instead of falling back to the English version of the pages, they were returning a 404. A secondary update was made to the NGINX config for nodejs.org on the 17th that fixed this, as well as a further set of updates to reduce the load on the origin server.

@diaspd diaspd changed the title Node.js March 17th Infrastructure Incident Post-mortem Node.js March 17th Infrastructure Mar 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant