It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.
— Martin Kleppmann, Designing Data-Intensive Applications
While we can't engineer away every failure, we can document the most likely failure modes and design against them so one fault doesn't take down the whole system.