Technical Resilience
What it describes and how it could help enterprises to succeed
From One Change to Blackout
Sometimes outages cannot be avoided. A small change in a huge and complex system can potentially lead to an outage affecting a billion of people for hours. Recent events (e.g. the facebook outage[1]) which affected most of the readers of this article showed that. A new discipline arose in the engineering world during the last years, called technical resilience which addresses disturbances in a promising way.
What does Technical Resilience mean?
One of the main reasonings behind research is the assumption that outages and disruptive events will happen. Resilience approaches aim at the maximal reduction of negative outcomes of those events.
As the example of Facebook shows: outages are not just minor inconveniences, but cause irreparable damage to a company’s reputation and lasting financial losses.
Phases of Resilience
Main categories of activities within resilience cover the phases of preparedness, prevention, protection, response, and recover. Those phases are conceptually arranged in a cyclic way which is applied repetitively with intermediate learning actions.
How could Resilience have helped to prevent the outage?
Recent outages motivated us to share our thoughts about what we could have done to mitigate consequences of those events.
One major time-consuming activity during such events is analysis. The shorter the analysis time the sooner countermeasures can be applied to mitigate or even prevent negative consequences. With CuriX we even go one step further and identify potential problematic circumstances within systems prior critical system states.
When you take these hints seriously you can start analysis of events even before they occur. With the prewarning and predictive capabilities of CuriX we provide you with means to gain valuable time for crisis and disaster management.
We do this by predicting individual measured metrics, detection of anomalies, identifying critical system states and intelligent analytics of system topologies. Fully automated. Based on newest lessons learned in the fields of IT management, risk assessment and last but not least: resilience research.
It’s time for modern approaches like we do with CuriX, since recent outages have shown, that even big companies haven’t reached an optimal level of technical resilience, yet.
[1] https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/