Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmortem: broken site #2805

Closed
julieg18 opened this issue Sep 7, 2021 · 4 comments
Closed

Postmortem: broken site #2805

julieg18 opened this issue Sep 7, 2021 · 4 comments
Labels
A: website Area: website postmortem After-failure research and next steps

Comments

@julieg18
Copy link
Contributor

julieg18 commented Sep 7, 2021

This ticket contains a postmortem about dvc.org breaking yesterday.

High Level Summary

dvc.org broke for a couple hours yesterday. None of the buttons and links were working correctly. The first sign of the site breaking was a sentry alert, and a team member noticed about two hours later after accessing the site. @shcheklein found out that #2779 was most likely what broke the site. After we reverted the pull request, the site started working correctly again.

Timeline

All times in UTC time

Perf indicators:

  • Time to notice: about two hours
  • Fix Developed After: 4 minutes
  • Resolved After: 17 minutes

Impact

  • No users were able to use dvc.org correctly

Root cause analysis

  • Pull request Restyle live: initial docs draft #2279 was about upgrading some packages, fixing some typescript/formatting errors. Most likely one or more of the upgraded packages is what caused the site to break. Though further testing/research will be needed to find out which one and why exactly.

Prevention and next steps

  • Maybe keep a closer eye on Sentry errors, though I don't think any members of the websites team were online at the time.
@julieg18 julieg18 added A: website Area: website postmortem After-failure research and next steps labels Sep 7, 2021
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 7, 2021

2021-4-20

I think all the dates were 2021-09-06 🙂

33:03

Interesting time of day 😬

As for prevention, I think #2779 should've been 2 PRs: one for yarn update, and another one for formatting files. The former could be a task left for the website team to do periodically, while the docs team can do the formattings when/if needed.

@shcheklein
Copy link
Member

Thanks @julieg18 for the postmortem. A great summary. One thing we need to discuss better is to

Maybe keep a closer eye on Sentry errors, though I don't think any members of the websites team were online at the time.

I think that one the items that we should really prioritize is Sentry. We need to denoise the channel. The that the prevented us from paying attention to the problem until we personally hit it.

I think I was online and @jorgeorpinel was around.

And for critical issues we should all be paying more attention and agree that someone keeps (or all of doing our best - that would the easiest) an eye on that channel even when we are on vacation.

@julieg18
Copy link
Contributor Author

julieg18 commented Sep 7, 2021

I think that one the items that we should really prioritize is Sentry. We need to denoise the channel. The that the prevented us from paying attention to the problem until we personally hit it.

Makes sense! We could update Sentry to only send alerts to Slack if we have a certain amount of users that are getting the error. Looking at our current dvc.org errors that should lower the amount since over half of our dvc.org issues have 1 or 2 users.

@shcheklein
Copy link
Member

@julieg18 yep, sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: website Area: website postmortem After-failure research and next steps
Projects
None yet
Development

No branches or pull requests

3 participants