Skip to content
This repository has been archived by the owner on Feb 7, 2025. It is now read-only.

CA Game Day Exercises #1132

Closed
11 tasks done
JohnNKing opened this issue Jun 5, 2024 · 7 comments
Closed
11 tasks done

CA Game Day Exercises #1132

JohnNKing opened this issue Jun 5, 2024 · 7 comments

Comments

@JohnNKing
Copy link
Contributor

JohnNKing commented Jun 5, 2024

Backlog Task

To ensure we're ready to support unexpected issues in production, as a team we need to run through some sample scenarios.

Option to use a collaborative coding paradigm to ensure the whole team becomes familiar with the troubleshooting process.

Potential Exercises:

  • LIMS / EHR offline
  • TI or RS offline
  • Inter-service auth failure
  • Invalid message contents
  • Corrupt zip file via SFTP
  • Changed SFTP encryption key

Pre-conditions

  • Access to PagerDuty (or similar tool for notification) (PagerDuty Setup PagerDuty Setup #177)
  • Access to RS metabase
  • Access to appropriate RS Slack channels
  • Access to RS Azure logs & environment (storage blob containers)

Completion Criteria

  • We've completed at least two exercises:
    • One demonstrating an unexpected failure within the TI service.
    • One demonstrating an unexpected failure within ReportStream or the receiving service.

Tasks

  • Determine the scenarios
  • Determine when this exercise will take place
    • Get support from the RS team
  • Document an intro to the session
  • Submit sample data for scenarios
  • Staging configuration change to support one of the scenarios
  • Conduct a blameless post-mortem
  • Conduct a retro for the whole exercise

Other Notes

  • Any other notes to help clarify this task for the team
@JohnNKing JohnNKing added the foundational A foundational backlog task label Jun 5, 2024
@JohnNKing JohnNKing changed the title Game Day Exercises CA Game Day Exercises Jun 5, 2024
@JohnNKing JohnNKing added California CA - Essential and removed foundational A foundational backlog task labels Jun 5, 2024
@JohnNKing
Copy link
Contributor Author

Our Game Day is slated for tomorrow. Two scenarios have been prepped. Test messages and staging config updates are planned for tomorrow.

@JohnNKing
Copy link
Contributor Author

Pushing this back as we've had to defer due to staging access issues.

@JohnNKing
Copy link
Contributor Author

The Game Day is now scheduled for this Friday at Noon ET.

@JohnNKing
Copy link
Contributor Author

Game Day Intro: The goal is to be better prepared when an incident takes place in production. Today, we’ll be treating a number of issues in staging as if they were real issues taking place in production. We should leverage our documented incident response process, as well as the troubleshooting document that shows how to detect ETOR errors in TI and ReportStream. Both are in Notion. Any questions?

@JohnNKing
Copy link
Contributor Author

We completed one of the exercises last Friday, with the next scheduled for this coming Friday.

@JohnNKing
Copy link
Contributor Author

Part 2 of the Game Day completed today. We're planning to host a retro for part 2 on Monday.

@JohnNKing
Copy link
Contributor Author

Retro started, but we'd like to continue this tomorrow (Tuesday)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants