Role of Software Engineer
Software engineers should concern about dependability
- Software should perform the required function
- Software should perform the required function correctly!
- Software and system are co-designed for fault-tolerance.
- Software often needs to take action when failures occur.
- Software defects have become common causal factors in failures.
Dependability Requirements
Only system that does not fail is not sufficient!
We need to consider
- the hardware has a finite life
- the software is inevitably buggy
- we can not predict precisely when any given system will fail
Terminology
Dependable: capable of being depended on: RELIABLE
Reliable: that may be relied on: DEPENDABLE in achievement
Rely: 1. to depend confidently; 2. to put trust in
Dependability
The dependability of a system is the ability to avoid service failures that are more frequent and more severe than is acceptable.
A system is an entity that interacts with other entities. These other entities are the environment of the given system.
The definition of "system" is recursive.
Service Failures
Correct service is delivered when the service implements the system function.
Service failure is an event that occurs when the delivered service deviates from correct service.
There are many kinds of failures
Domain
- Content failure: the system delivers the wrong result
- Timing failure: the system delivers a late result
Consistency
In the case of byzantine failures, different users experience the failure in different ways.
Detectability
Failures might be (or not) signaled by a detection mechanism
The users can perform proactive actions, e.g., perform a planned shutdown of a nuclear reactor; drive a car more carefully.
Severity
Severity is important information to guide engineering efforts.
Reliability
\(R(t)=\) probability that the system will operate correctly in a specified operating environment for a period of time of length \(t\).
\(t\) is mission time
Availability
\(A(t)=\) probability that the system will be operational at time \(t\).
Availability = Mean Time to Failure ÷ (Mean Time to Failure + Mean Time to Repair)
Reliability VS Availability: Reliability is a stronger requirement than availability
System design can meet high availability requirements by
- ability to detect failed components
- ability to replace them
For high reliability, care about continuity of service without any failure.
Safety
Absence of a catastrophic consequences on the users and the environment
Tempered by reality, since there is always a small probability that catastrophic failures can happen
Other Attributes
Maintainability: The ability to undergo repairs and modifications.
Confidentiality: absence of unauthorized disclosure of information
Integrity: absence of unauthorized alterations of information
The Fundamental Principle of Dependability
Why failures occur?
When 1. The system enters an erroneous state; 2. The erroneous state manifests itself in the system's external state.
The cause of an error is called a fault.
A fault can be internal or external to the system.
Fault-error-failure Chain
Hazard: A hazard is failure that exposes the system to potential accidents
Types of Faults
Degradation faults
- A component used to work, but no longer does
- System it self experienced a fault. If we do right, the failed component does not lead to a system failure.
Design faults
- Bad design can affect both hardware and software.
- Design faults can be missed by quality checks
- Nothing changes during the lifetime of the equipment
Dealing with Faults
4 fundamental approaches: Fault avoidance, Fault elimination, Fault Tolerance, Fault Forecasting
Fault Avoidance
Best approach!
Building systems where faults are absent and con not arise
Limits on capability, applicability
Fault Elimination
Find and eliminate faults from the system before use.
(Software testing...)
No guarantees that all faults are found!
Fault Tolerance
Faults will be present in the system during operational use.
The system can either mask the failure, or switch to degraded service
Error recovery
- Forward Error Recovery: removes the failed component, bring the system in a new state
- Backward Error Recovery: brings the system to a state that existed in the past
Software diversity
- The most powerful forms of software fault tolerance: design diversity, data diversity, environment diversity
Fault Forecasting
If we can not deal with faults, we can
- evaluate their expected frequency and severity
- accept their existence and that they will not be removed/tolerated
Software faults are quite unpredictable :-(
Other forms of Targeted Fault Tolerance
Safety Kernel
- Components that extend the software architecture
- checks for system transitions that lead to hazards
- stop the system before the transition
Benefits:
- can tolerate the bulk of software failures
- the relative simplicity of safety kernels makes verification more manageable
Application Isolation
An application failure does not interfere with other applications of the system
Supported by a separation kernel
Watchdog Timers
In real-time systems, missed deadlines tend to be hazards
A watchdog timer is a special form of safety kernel for enforcing deadlines.
Exceptions
Exceptions are language constructs for asynchronous events.
Limited utility as a fault-tolerance mechanism!
Incomplete to detect errors, difficult to recover
Runtime Checking
Software developers can implement their own error-detection in software
Woven through the rest of the software
Assertion
Pre/post conditions
Incomplete to detect errors, unsupported to recover
Conclusion
Dependability is a key concern for software engineers
Dependability requirements must be established, in cooperation with system engineers
Rigorous terminology
Design faults are a major source of failures
Fault avoidance and elimination do not fully prevent them
There are no rigorous dependability guarantees