The Safety Analysis Toolbox for Mission-Critical Systems

Anunay Krishnamurthy
Jan 12
6 min read

Updated: Feb 10

During the development of mission-critical systems, safety risks emerge at different stages of the system lifecycle. No single safety analysis can address all of these risks. Instead, an effective safety strategy relies on a combination of safety analyses.

The selection of appropriate analyses depends on the stage of the project, the information available at that point, and the types of issues or failure mechanisms that need to be identified.

Some of the most common analyses used in the industry are shown in the diagram below

Figure: Different types of Safety Analyses

FMEA (Failure Mode, Effect, and Analyses)

FMEA is an inductive, or bottom-up, safety analysis. It is called “bottom-up” because the analysis starts with the design of an existing system or process and examines what can go wrong at the level of individual components, functions, or subprocesses, and how those failures could lead to hazards at the system level.

This approach makes FMEA highly practical in product development. At any stage of the lifecycle, teams can systematically identify potential failure modes and define mitigations before those failures occur in the field.

For example, during the concept phase, when specific components may not yet be selected, a Functional FMEA can be performed to identify what could go wrong at the functional level and how those risks can be mitigated. As the design matures, Hardware FMEA and Software FMEA can be used to analyze failure modes specific to implemented hardware and software elements.

Recommended Use:

While FMEAs can be performed throughout the product lifecycle, they are most effective when the system has reached a sufficient level of maturity. For instance, Hardware FMEA is typically most valuable once the hardware design is largely complete, and Software FMEA once the software architecture and detailed design are defined.

Limitations:

FMEA is not well suited for analyzing multi-point or dependent failures, as it focuses on single-component or single-process failure modes. For scenarios involving combinations of failures or complex fault interactions, Fault Tree Analysis (FTA) is generally a more appropriate method.

More information on the FMEA process, challenges, and examples are available in this article.
FMEDA (Failure Mode, Effect, and Diagnostic Analyses)

FMEDA is also an inductive, or bottom-up, safety analysis. Like FMEA, it starts from the system architecture and examines individual elements and their potential failure modes. In addition, FMEDA evaluates how effectively diagnostics detect or control those failures.

FMEDA is a powerful analysis for quantitatively assessing the robustness of a system design. By combining failure mode identification with diagnostic coverage and failure rate data, it provides insight into how well the design prevents, detects, or mitigates dangerous failures.

Recommended Use:

FMEDA is most commonly applied in hardware-based development, particularly in mechanical, electrical, and electronic (E/E) systems. It is recommended that FMEDA be performed once the system architecture and hardware design have reached a high level of maturity.

In E/E systems, FMEDA results are often used to calculate key architectural safety metrics such as Single Point Fault Metric (SPFM), Latent Fault Metric (LFM), and Probabilistic Metric for Hardware Failure (PMHF). These metrics provide a quantitative measure of how safe the system design is and are frequently required for compliance with functional safety standards.

Limitations:

FMEDA is not well suited for analyzing multi-point or dependent failures, as it primarily focuses on individual element failure modes. In addition, estimating accurate quantitative failure rates can be complex and data-intensive, often requiring assumptions, field data, or standardized reliability databases.
FTA (Fault Tree Analyses) / ETA (Event Tree Analyses)

FTAs and ETAs are deductive, or top-down, safety analyses. They are referred to as “top-down” because the analysis starts with a known problem or hazard and systematically works downward to identify what in the system could cause that hazard. This is done by repeatedly asking the question, “What can cause this?” and decomposing the hazard into progressively smaller contributing events or failures.
This approach is particularly effective for breaking down complex hazards into understandable and traceable causes.

For example, when developing a humanoid robot, a top-level hazard might be collision between the robot and a human. From there, the analysis can explore potential causes such as perception errors, incorrect processing or decision-making, or failures in the actuation system. The analysis can then be taken further down - for instance, into the actuation chain - by asking what could cause excessive or unintended torque at a motor. Possible causes might include incorrect torque commands, overvoltage conditions, or a malfunctioning motor driver.

Recommended Use:

FTA and ETA are best applied as early as possible in the product development lifecycle. They help teams identify the causes of hazards and design safety mechanisms to prevent or control them before detailed implementation begins. These methods are particularly well suited for analyzing multi-point and dependent failures, as well as understanding combinations of events that can lead to hazardous outcomes.

Limitations:

FTAs and ETAs s are not effective at identifying new or previously unrecognized hazards. The analysis is limited to exploring the causes of hazards that are already known. They do not address whether a component or function could introduce a new hazard, which is something better handled by hazard identification methods such as HARA or inductive analyses like FMEA.
HARA (Hazard and Risk Analyses)

HARA is a high-level safety analysis typically performed at the beginning of a project. Its purpose is to systematically identify potential hazards in a system and assess the risk associated with each hazard. This early analysis helps establish a safety baseline for the entire development effort.

Risk is generally defined as a combination of the severity of harm that could result from a hazard and the likelihood that the hazard will lead to that harm. How risk is expressed varies by industry and applicable safety standards. For example, in automotive development under ISO 26262, risk is expressed using Automotive Safety Integrity Levels (ASIL), ranging from QM (no safety integrity requirement) through ASIL A to ASIL D, with ASIL D representing the highest safety criticality. In aerospace systems developed under DO-178, risk is expressed using Design Assurance Levels (DAL), ranging from DAL E (lowest criticality) to DAL A (highest criticality). Different industries use similar classification schemes tailored to their regulatory and operational contexts. There is an equivalence between the different risk levels. This article talks more about the equivalence.

Recommended Use:

HARA should be performed at the very beginning of any safety-critical system development. It provides an overall view of the system’s risk profile and drives key downstream decisions, including safety goal definition, component selection, architectural design choices, and project planning and management activities.

Limitations:

HARA focuses on identifying hazards and classifying their associated risks. It does not analyze detailed failure mechanisms or propose specific technical solutions. Those aspects are addressed later through analyses such as FMEA, FMEDA, and FTA.
HAZOP (Hazard and Operability Analyses)

HAZOP is a high-level safety analysis typically performed at the beginning of a project to identify potential problems or hazards in a system. It can be conducted as part of a broader HARA or as an independent analysis.

HAZOP uses a systematic approach that combines predefined malfunction keywords with system functions to uncover deviations from intended operation, potential malfunctions, and associated hazards. This structured method ensures that even subtle or unexpected failure scenarios are considered.

Recommended Use:

For all safety-critical systems, it is recommended to perform a HAZOP early in the project, ideally as part of the HARA. This ensures that hazards are identified upfront, allowing safety requirements and mitigation strategies to be incorporated into the system design from the start.

Limitations:

HAZOP focuses exclusively on identifying hazards. It does not analyze how to mitigate those hazards or evaluate solutions. These tasks are addressed by downstream analyses such as FMEA, FMEDA, or FTA.
STPA (System-Theoretic Process Analysis)

STPA is a high-level safety analysis used to identify accidents and hazards in complex systems. The methodology involves:
- Identifying the system’s control structure.
- Identifying unsafe control actions within controllers, control actions, feedback loops, and controlled processes.
- Determining causal factors and creating scenarios that could trigger hazards.
STPA is widely used in the aerospace and automotive industries. It is particularly powerful for identifying not only hazards but also the triggering conditions that could lead to those hazards, making it well-suited for modern, software-intensive, or autonomous systems.

Recommended Use:

In the automotive industry, STPA can support SOTIF (Safety of the Intended Functionality) assessments. For example, hazards identified during a HAZOP can be further analyzed with STPA to determine triggering conditions.

STPA can also be applied to AI/ML-based systems (e.g., under ISO 8800), identifying conditions in algorithms or models that could result in unsafe behavior.

Limitations:

STPA is a generic hazard identification method. It does not:
- Quantify risk.
- Determine the severity or probability of hazards.
- Evaluate the adequacy of mitigation measures in quantitative terms.
  As a result, STPA should be complemented with other analyses (e.g., FMEA, FMEDA, FTA) to fully assess risk and verify safety solutions.
Others (E.g. Ishikawa, Human Factor Analyses)

Ishikawa diagrams, also known as fishbone diagrams, are a visual tool for root cause analysis. They help teams systematically explore potential causes of a problem or hazard across categories such as people, processes, equipment, materials, and environment. In principle, it is similar to an FTA/ ETA; but is visually presented in a different format.

Human Factors Analyses focus on the interaction between humans and systems to identify potential errors, misuse, or unsafe behaviors. These analyses evaluate ergonomics, interface design, cognitive load, biases, and operational procedures, ensuring that systems are designed to reduce human error and improve overall safety.

Conclusions:

No single safety analysis can cover all aspects of a complex system. A combination of inductive and deductive analyses, applied at the right stage of development, provides a comprehensive understanding of hazards, risks, and mitigations. By integrating high-level analyses (like HARA, HAZOP, and STPA) with detailed analyses (like FMEA, FMEDA, and FTA), teams can design safer, more reliable systems and make informed decisions throughout the product lifecycle.

The Safety Analysis Toolbox for Mission-Critical Systems

FMEA (Failure Mode, Effect, and Analyses)

Recommended Use:

Limitations:

FMEDA (Failure Mode, Effect, and Diagnostic Analyses)

Recommended Use:

Limitations:

FTA (Fault Tree Analyses) / ETA (Event Tree Analyses)

Recommended Use:

Limitations:

HARA (Hazard and Risk Analyses)

Recommended Use:

Limitations:

HAZOP (Hazard and Operability Analyses)

Recommended Use:

Limitations:

STPA (System-Theoretic Process Analysis)

Recommended Use:

Limitations:

Others (E.g. Ishikawa, Human Factor Analyses)

Conclusions:

Recent Posts

Comments