Role of Software Engineer

Software engineers should concern about dependability

Software should perform the required function
Software should perform the required function correctly!
Software and system are co-designed for fault-tolerance.
Software often needs to take action when failures occur.
Software defects have become common causal factors in failures.

Dependability Requirements

Only system that does not fail is not sufficient!

We need to consider

the hardware has a finite life
the software is inevitably buggy
we can not predict precisely when any given system will fail

Terminology

Dependable: capable of being depended on: RELIABLE

Reliable: that may be relied on: DEPENDABLE in achievement

Rely: 1. to depend confidently; 2. to put trust in

Dependability

The dependability of a system is the ability to avoid service failures that are more frequent and more severe than is acceptable.

A system is an entity that interacts with other entities. These other entities are the environment of the given system.

The definition of "system" is recursive.

Service Failures

Correct service is delivered when the service implements the system function.

Service failure is an event that occurs when the delivered service deviates from correct service.

There are many kinds of failures

Domain

Content failure: the system delivers the wrong result
Timing failure: the system delivers a late result

Consistency

In the case of byzantine failures, different users experience the failure in different ways.

Detectability

Failures might be (or not) signaled by a detection mechanism

The users can perform proactive actions, e.g., perform a planned shutdown of a nuclear reactor; drive a car more carefully.

Severity

Severity is important information to guide engineering efforts.

Reliability

\(R(t)=\) probability that the system will operate correctly in a specified operating environment for a period of time of length \(t\).

\(t\) is mission time

Availability

\(A(t)=\) probability that the system will be operational at time \(t\).

Availability = Mean Time to Failure ÷ (Mean Time to Failure + Mean Time to Repair)

Reliability VS Availability: Reliability is a stronger requirement than availability

System design can meet high availability requirements by

ability to detect failed components
ability to replace them

For high reliability, care about continuity of service without any failure.

Safety

Absence of a catastrophic consequences on the users and the environment

Tempered by reality, since there is always a small probability that catastrophic failures can happen

Other Attributes

Maintainability: The ability to undergo repairs and modifications.

Confidentiality: absence of unauthorized disclosure of information

Integrity: absence of unauthorized alterations of information

The Fundamental Principle of Dependability

Why failures occur?

When 1. The system enters an erroneous state; 2. The erroneous state manifests itself in the system's external state.

The cause of an error is called a fault.

A fault can be internal or external to the system.

Fault-error-failure Chain

Hazard: A hazard is failure that exposes the system to potential accidents

Types of Faults

Degradation faults

A component used to work, but no longer does
System it self experienced a fault. If we do right, the failed component does not lead to a system failure.

Design faults

Bad design can affect both hardware and software.
Design faults can be missed by quality checks
Nothing changes during the lifetime of the equipment

Dealing with Faults

4 fundamental approaches: Fault avoidance, Fault elimination, Fault Tolerance, Fault Forecasting

Fault Avoidance

Best approach!

Building systems where faults are absent and con not arise

Limits on capability, applicability

Fault Elimination

Find and eliminate faults from the system before use.

(Software testing...)

No guarantees that all faults are found!

Fault Tolerance

Faults will be present in the system during operational use.

The system can either mask the failure, or switch to degraded service

Error recovery

Forward Error Recovery: removes the failed component, bring the system in a new state
Backward Error Recovery: brings the system to a state that existed in the past

Software diversity

The most powerful forms of software fault tolerance: design diversity, data diversity, environment diversity

Fault Forecasting

If we can not deal with faults, we can

evaluate their expected frequency and severity
accept their existence and that they will not be removed/tolerated

Software faults are quite unpredictable :-(

Other forms of Targeted Fault Tolerance

Safety Kernel

Components that extend the software architecture
checks for system transitions that lead to hazards
stop the system before the transition

Benefits:

can tolerate the bulk of software failures
the relative simplicity of safety kernels makes verification more manageable

Application Isolation

An application failure does not interfere with other applications of the system

Supported by a separation kernel

Watchdog Timers

In real-time systems, missed deadlines tend to be hazards

A watchdog timer is a special form of safety kernel for enforcing deadlines.

Exceptions

Exceptions are language constructs for asynchronous events.

Limited utility as a fault-tolerance mechanism!

Incomplete to detect errors, difficult to recover

Runtime Checking

Software developers can implement their own error-detection in software

Woven through the rest of the software

Assertion

Pre/post conditions

Incomplete to detect errors, unsupported to recover

Conclusion

Dependability is a key concern for software engineers

Dependability requirements must be established, in cooperation with system engineers

Rigorous terminology

Design faults are a major source of failures

Fault avoidance and elimination do not fully prevent them

There are no rigorous dependability guarantees