Skip to content
This repository has been archived by the owner on Feb 24, 2022. It is now read-only.

Incident Response Checklist

Porta Antiporta edited this page Jan 17, 2022 · 14 revisions

This is a quick checklist for any incident (security, privacy, outage, degraded service, etc.) to ensure the team can focus on time critical mitigation/remediation while still communicating appropriately.

ℹ️ This is a checklist for quick reference during an incident. The full guide is located here

Checklist

Initiate

  • Incident declared in #tts-covidtest-situation using @here to get everyone's attention
  • Notify the rest of the cloud.gov team in #cloud-gov using @cg-team and @here
  • Situation Lead and team assemble in War Room (See the Topic in #covid-home-test-situation channel for the link)
  • Roles assigned and duties started:

Assess

  • Incident confirmed
    • System security potentially compromised
    • System unavailable or functionality degraded
    • System under significant active attack from outside or inside threat
    • System integrity in question
  • Severity assigned (can be changed later as new information is collected)
    • High: Confirmed PII breach, confirmed security penetration, complete outage
    • Medium: Suspected PII breach, suspected security penetration, partial outage
    • Low: Suspected attack, outage of non-prod persistent system (stage)
  • If user or partner impacting, communicate this as an @channel in #test-website
  • If secure shared notepad is needed, Google Doc opened and shared https://drive.google.com/drive/folders/1TWTMp_w55niNuqC7vTPDEe5vkxaiP4P0 (Contents should be copied to official issue)

Remediate

  • For security incidents, consult official policy before destroying ANY evidence! Contain: Detach a compromised instance, do not destroy!

Loop through per-role items until remediation is complete.

By Role

  • Situation Lead (SL)
    • Wellbeing of group monitored, including self (Tired and stressed humans make poor decisions)
    • Rotations of all roles planned and performed to prevent any responder spending more than 3 hours in role
  • Technical Lead (TL)
    • Lead technical response till issue is remediated
    • OR role is handed off
  • Comms Lead (CL)
    • Regular updates to interested parties provided
    • StatusPage updated as status changes
  • Scribe (SC)
    • Ensure a full record is being maintained in Slack

Upon remediation:

Retrospect

  • Postmortem doc started from copy of Postmortem Template
  • Postmortem meeting scheduled with entire incident response team

Resources

Kudos

To the Login.gov handbook from which 99% of this was taken.