This Software Reliability Model (SRM) provides a flexible and explainable model of software reliability in terms of technical foundations, socio-technical constraints, and human factors. It is designed to help explore and explain software reliability to people of various roles who are involved in building and running software systems (especially when on-boarding new team members), and track progress in improving reliability. The SRM is also designed to make it easy to generate and update hierarchical metrics for product health scores across multiple teams.
The software reliability model is designed to be relevant to several different kinds of software systems:
- internet and cloud-based software
- desktop software
- IoT and embedded software
- (combinations of the above)
The different team measures, context, and genres of influences are linked to create a graph that helps to explain the dynamics around software reliability. The graph is then visualised using the visualisation tool Kumu.
This software reliability model was co-created by people from TELUS Digital (@telus) and Conflux (@ConfluxHQ), with significant contributions from:
- Bojan Savic of TELUS Digital
- Matthew Skelton of Conflux
The SRM is aimed at these kinds of people:
- Product Owner / Product Manager
- SRE Manager / SRE Lead
- Software Architect / Systems Architect / Test Architect
- Software Developer
- Software Tester
The SRM helps these people to explore and discuss different aspects of software reliability to help make targeted improvements.
The SRM is composed of 2 main parts:
- definitions of reliability factors in CSV format suitable for import to Kumu 📄
- graphs in Kumu generated by importing the CSV definitions 📊
The CSV files (and visualisation settings) are imported into Kumu to generate explorable graphs.
There are several ways to use the SRM. Here are some suggestions:
- Freeform exploration: use the Kumu graphs to investigate different aspects of reliability in a free-form way.
- Guided Workshops: use the context groupings to do a deep dive into specific aspects of reliability. For example, run a 90-minute workshop on Decoupling and isolation or Speed of remeditation. Use the workshop to get a sense of awareness within the team of the team-level practices and measures that sit under that context parent node. Then repeat the workshop but with a new context.
- Metrics roll-up: use the SRM to score teams on their current reliability practices and status. The Metric and Measure details for each leaf node provide details of what to measure and the type of measurement. Aggregate the measures into the parent nodes until you have a single score for Reliability for that team.
- All 3 of the above: use all three above approaches for maximum benefit, helping the team members to understand how they can help to improve reliability on a daily basis.
Visit the latest stable version of the reliability model on Kumu: https://kumu.io/reliability-model/latest
See all versions of the model: https://kumu.io/reliability-model
There are several types of factors in the SRM - each factor type is shown differently in the Kumu graph:
- team measure - team-level measures that influence reliability
- context - the context in which measures are taken
- genre - the high-level grouping of measures
- reliability - the ultimate goal of all these factors
Tags are used to explore different dimensions of the model:
- 4 Key Metrics - from the book Accelerate:
- lead time
- deployment frequency
- Mean Time To Restore (MTTR)
- Change failure rate
- CodeScene - measures from the tool CodeScene (see CodeScene.io)
- Continuous Delivery - measures from the Continuous Delivery dimension of MSDA
- Deployment - measures from the Deployment dimension of MSDA
- Deployment technique - techniques for reliability focused on deployment aspects
- Flow - measures from the Flow dimension of MSDA
- Human technique - techniques for reliability focused on human aspects
- MSDA - measures from Multi-team Software Delivery Assessment (MSDA)
- On-call - measures from the On-call dimension of MSDA
- Operability - measures from the OPerability dimension of MSDA
- RTCE - measures from the Reliability Through Customer Eyes (RTCE) principles devised by TELUS and Conflux.
- Reliability and SRE - measures from the Reliability dimension of MSDA
- Runtime technique - techniques for reliability focused on runtime aspects
- Team Health - measures from the Team Health dimension of MSDA
- Team Topologies - measures derived from the book Team Topologies
- Team API - measures relating to the 'Team API' concept in Team Topologies
- Team Autonomy - measures relating to team autonomy as discussed in Team Topologies
- Team Cognitive Load - measures relating to the 'Team Cognitive Load' concept in Team Topologies
- Testability - measures from the Testability dimension of MSDA
- UX - measures relating to end-user experience
- Version Control Hygiene - measures relating to good version control practices
- Names of Kumu element types, connection types, and tags are "selector friendly" for the Kumu advanced editor - a single word.
- The graph layout visualization is controlled by the settings in the
*.css
view files (imported into the Advanced Editor settings in Kumu). - The CSV data import in Kumu needs some attention to detail. Be sure to follow the CSV import details.
These books influenced the reliability model significantly:
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim
- Agile Testing by Lisa Crispin and Janet Gregory
- Continuous Delivery by Jez Humble and Dave Farley
- Growing Object-Oriented Software by Steve Freeman and Nat Price
- Principles of Product Development Flow by Don Reinertsen
- Seeking SRE edited by David N. Blank-Edelman
- Site Reliability Engineering by Betsy Beyer, Chris Jones, Jennifer Petoff, & Niall Murphy
- Team Guide to Metrics for Business Decisions by Mattia Battiston and Chris Young
- Team Guide to Software Operability by Matthew Skelton, Alex Moore, & Rob Thatcher
- Team Guide to Software Testability by Ash Winter and Rob Meaney and the companion website TestabilityQuestions.com
- The Site Reliability Workbook edited by Betsy Beyer, Niall Richard Murphy, David K. Rensin, Kent Kawahara, & Stephen Thorne
- Working Effectively with Legacy Code by Michael Feathers
- Use CI to test Pull Requests against Kumu import:
- Duplicate nodes?
- Dangling connectors?