You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user of data platform I want to enhance our observability and monitoring capabilities to ensure the reliability, performance, and security of our systems.
I expect that by implementing a comprehensive set of tools, processes, and best practices, we will empower our teams to proactively detect, diagnose, and respond to issues.
So that.. ultimately delivering a better experience to our users.
Proposal
Improved Data Collection: Enhance our data collection mechanisms to gather more relevant and actionable information from our systems.
Real-time Visibility: Achieve real-time visibility into the health and performance of all critical systems and services.
Alerting and Notification System: Implement a robust alerting and notification system that delivers timely alerts to the right stakeholders.
Root Cause Analysis: Develop tools and processes for efficient root cause analysis to reduce downtime and improve system reliability.
Scalability and Flexibility: Ensure that our observability and monitoring solutions can scale with our growing infrastructure and adapt to changing requirements.
Security Monitoring: Strengthen our security monitoring to detect and respond to security threats and vulnerabilities proactively.
Documentation and Training: Provide comprehensive documentation and training for teams to effectively utilise observability and monitoring tools.
Integration and Automation: Integrate observability and monitoring into our CI/CD pipelines and automate common responses to known issues.
Definition of Done
README has been updated
User docs have been updated
Another team member has reviewed
Tests are green
All critical systems and services are instrumented for observability.
Real-time dashboards provide key performance metrics and system health information
Alerts are triggered based on predefined thresholds, and notifications are sent to the appropriate stakeholders.
Security monitoring is in place, and suspicious activities are investigated promptly
User Story
As a user of data platform I want to enhance our observability and monitoring capabilities to ensure the reliability, performance, and security of our systems.
I expect that by implementing a comprehensive set of tools, processes, and best practices, we will empower our teams to proactively detect, diagnose, and respond to issues.
So that.. ultimately delivering a better experience to our users.
Proposal
Improved Data Collection: Enhance our data collection mechanisms to gather more relevant and actionable information from our systems.
Real-time Visibility: Achieve real-time visibility into the health and performance of all critical systems and services.
Alerting and Notification System: Implement a robust alerting and notification system that delivers timely alerts to the right stakeholders.
Root Cause Analysis: Develop tools and processes for efficient root cause analysis to reduce downtime and improve system reliability.
Scalability and Flexibility: Ensure that our observability and monitoring solutions can scale with our growing infrastructure and adapt to changing requirements.
Security Monitoring: Strengthen our security monitoring to detect and respond to security threats and vulnerabilities proactively.
Documentation and Training: Provide comprehensive documentation and training for teams to effectively utilise observability and monitoring tools.
Integration and Automation: Integrate observability and monitoring into our CI/CD pipelines and automate common responses to known issues.
Definition of Done
README has been updated
User docs have been updated
Another team member has reviewed
Tests are green
All critical systems and services are instrumented for observability.
Real-time dashboards provide key performance metrics and system health information
Alerts are triggered based on predefined thresholds, and notifications are sent to the appropriate stakeholders.
Security monitoring is in place, and suspicious activities are investigated promptly
🚀 Add Managed Prometheus and Managed Grafana #1825
Implement Auth0 Log Collection in Modernisation Platform #1742
⚗️ Spike: Using ADOT for Observability - Open Telemetry #1830
Spike: Integrate Labs' Logging and metrics into our Monitoring Stack #1888
🔱 Investigate using
kube-prometheus-stack
instead of vanilla Prometheus #1895:alert: Add Alerting #1924
📈 Collect List of Signals for New World EKS Clusters #1926
Define and Implement Metrics for Monitoring of Cloud Platform-migrated AP Apps #1095
⚡ Implement Prometheus Operator Linter #1939
🕵️ Investigate if its possible to programatically control Observability Platform Grafana #2028
🕵️ Investigate if its possible to sync AWS IAM Identity Center groups with Grafana Teams #2029
SPIKE- Investigation on Mapping all logging to Grafana Dashboard #2034
📊 Pull OpenMetaData metrics logs into our observability stack #2099
Implement Collection of CloudWatch Logs & Metrics from Labs #2145
🔍 Setup DPAT Alerting Routes in Alpha Grafana #2234
🔍 Define Alerts for our Modernisation Platform and route them #2235
🔎 Implement RBAC in Grafana to allow Users to configure dashboards and alerts #2236
⚡ Add XRay to Observability Platform Role to allow Metrics collection #2237
🚨 Alerting on Eventbridge #2331
💵 Capture AWS costs for AP #2009
The text was updated successfully, but these errors were encountered: