Resilience goals for Oak

This page is an effort to clarify the concept of resilience and related and to define goals for Oak to that respect.

Resilience

Resilience refers to the ability to withstand, contain and recover from failures.

A single failure refers to a single component failing at any given time while multiple failures means that more than one component may fail at the same time.

To withstand a failure means to stays operational and sufficiently responsive during the time a failure occurs.

To contain a failure means its adverse effect does not spread beyond its initial scope. I.e. there is no collateral damage.

To recover from a failure means undoing (e.g. automatically or by manual intervention) the impact that has been caused by a failure and return to normal operation.

The impact of a failure roughly falls into one of six levels where each level is worse than its predecessor:

Goals for Oak

Oak should be resilient against simple failures such that complete outages (level 5) do not occur. Oak is not resilient against multiple failures though and sufficient redundancy needs to be built into the system to cope with such.

Failures and their impact

Resilience (last edited 2014-05-21 09:42:42 by MichaelDürig)