-
Notifications
You must be signed in to change notification settings - Fork 32
Proposal For Reliability and Failure System Redesign
- Failures tend to happen in clumps, all at the same time
- Reliability % is abstract and doesn't mean a whole lot
- Different polling times on failure can result in more or less overall failures, even given the same reliability
- Parts have no control over timing in the system
- Pull based system lacks flexibility and can cause performance issues
- Rebuild existing polling system to move work for updates onto the parts. This will move from a Pull system to a Push system where the parts can directly determine their state.
- Moving forward reliability and failure will be handled by one (or both) of two systems. Simple Modeling and Complex Modeling
- Simple Modeling is designed to provide better, simpler, gameplay for a more Stock style KSP playthrough. While the Complex Modeling is designed to allow more complex and realistic modeling for Realism players. The two systems are however not mutually exclusive and can be used together as long as some common sense is used when combining them.
- The Simple Model will work very similar to as it does now, but with the changes to Reliability being based on a MTBF system. In the simple model, the part will have a base reliability rating, EG MTBF of 6 hours, or 2 flights, and triggered failures will choose a random failure assigned to the part as it is in the existing system.
- The Complex Model instead focuses on specific events triggering specific failures. While the part still has the same overall MTBF rating, individual actions modify that reliability and have the ability to trigger specific failures. EG if an engine is being gimbaled excessively, that action can have the possibility to trigger a gimbal failure.
Under the current system, a part is given a flat % reliability, and checks are made each polling cycle by rolling against that reliability and if the roll fails, a random failure from the list of part specific failures is triggered.
The new system will move to a more realistically modeled Failure Rate and Mean Time Between Failure (MTBF) rating for parts. As flight data is accumulated, the Failure Rate will decreases, and the MTBF for the parts will increase, meaning parts will fail less often.
With the simple model, the TestFlightReliability modules on the part will be used to compute an aggregate reliability based on available flight data and current operating conditions for the part. These will be reported to the played in two numbers, resting reliability and momentary reliability. The resting reliability is the static base reliability of the part calculated based on available flight data when the part was created and will not change during the flight. The momentary reliability is the moment to moment dynamic reliability that is calculated based on the current operating conditions of the part, which generate a modifier that is then applied to the resting reliability to produce a final indicator of current reliability.
When a part is deemed to fail, a random failure failure from the list of part specific failures is triggered. Once a part is in a failed state, it will not be checked for failure again until such time as the existing failure is repaired. Once the failure has been repaired, the part is considered to be "like new".
The complex model starts by using the same base resting reliability and momentary reliability system as described in the Simple Model, but then attempts to more realistic model the interactions of flight conditions vs related failures. In the complex system a general failure can still occur based on checks against the momentary reliability, which will result in a random failure if desired, though it is possible to assign zero general failures if desired.
In addition, the system introduces a new module called TestFlightFailureTrigger which is essentially a hybrid between a Reliability module and a Failure module. The FailureTrigger module starts by implementing a reliability calculation just like a base Reliability module, and uses the momentary reliability of the part as a base. It then applies modifiers to that reliability based on specific operating conditions that might cause a specific failure. This modified reliability is then reported back to the TestFlightCore. Assuming multiple FailureTrigger modules on a part, all affecting momentary reliability based on operating conditions, only the lowest momentary reliability will be reported to the player, along with a tooltip indicating the reason. Secondly the FailureTrigger module contains internally a Failure implementation that can only be triggered by itself, and based on its own calculations. The end result is that we can link the operating conditions to the failure that can occur, so for example an engine operating at extremely high temperatures can trigger an explosion.