Hardware random-failure metrics for ASIL D — PMHF, SPFM, LFM
ISO 26262 Part 5 imposes three quantitative hardware-level metrics that any ASIL D component must clear: SPFM (Single Point Fault Metric), LFM (Latent Fault Metric), and PMHF (Probabilistic Metric for Hardware Failures). Each measures a different way the architecture can fail, and passing one says nothing about whether the others pass. The most common audit finding is a team that computed PMHF, declared the design ASIL D, and never noticed an SPFM at 96% on a 99% target. This guide explains the underlying fault classification, defines each metric formally, walks an AEB ECU through all three, and names the operational mistakes.
What's actually being measured — the hardware fault classification
Before any of the three metrics make sense, the underlying classification of hardware faults has to be in place. ISO 26262 Part 5 §B.1 splits every hardware fault — every transistor failure, every solder-joint break, every register stuck-bit — into one of four buckets, each contributing differently to the safety case. The classification is per-failure-mode, not per-component: a single resistor can have an "open" failure mode that is safe and a "short" failure mode that is dangerous, contributing to different buckets.
| Bucket | Symbol | Definition | Contributes to |
|---|---|---|---|
| Safe fault | λS | Failure mode that cannot violate the safety goal in any operational context. E.g. a status LED dies — informational only. | Nothing harmful; ignored by all three metrics. |
| Single-point fault | λSPF | Single hardware failure that directly causes safety goal violation, with no diagnostic mechanism in place to detect it. | SPFM (must be small) and PMHF (large contribution). |
| Residual fault | λRF | Single hardware failure that would cause safety goal violation, but is partially covered by a diagnostic — the residual is the uncovered fraction. | SPFM (must be small) and PMHF (residual contribution). |
| Multi-point / latent fault | λMPF | Failure that, on its own, doesn't violate the safety goal, but combined with a second independent failure does. "Latent" is the subset that isn't detected before the second failure occurs. | LFM (must be small). |
The total per-component failure rate is decomposed as λ = λS + λSPF + λRF + λMPF. The bucket assignment is the work of the safety analyst — typically driven by an FMEDA (Failure Mode, Effects and Diagnostic Analysis), which is FMEA augmented with diagnostic-coverage scoring. ISO 26262 Part 5 Annex B walks the FMEDA template; tools like Reliasoft, Ansys medini, and FTA Studio's FMEDA mode produce the per-component breakdown automatically once failure modes and diagnostic coverage percentages are entered.
Three things to internalise about this classification before getting to the metrics:
- "Safe fault" is a strong claim. A failure mode is safe only if it provably cannot violate the safety goal — not "unlikely to", not "only matters in edge cases". The fail-safe direction of fault propagation has to be argued. Safe faults are excluded from SPFM denominator, so over-claiming safe-fault status is the most common way to fudge the metric.
- Diagnostic coverage is a per-fault-mode percentage. A diagnostic that covers 90% of stuck-at faults but 0% of bridging faults gets credit for the stuck-at modes only; the bridging modes count as fully residual. Lumping the two together at "90% DC" is a frequent FMEDA error.
- Multi-point latent faults are easy to ignore and hard to find. They look harmless in isolation. The classical example is a stuck-OK output of a watchdog timer — the watchdog stops doing anything useful, but until the watched-for fault occurs, nothing visible breaks. LFM is the metric that forces this category to be accounted for.
Step 1The fault classification, applied — choosing buckets
The bucket assignment for any given component is mechanical once you have the FMEA and the diagnostic mechanism descriptions. Take a representative AEB ECU's main MCU as the running example. The MCU has a per-hour failure rate of λ = 100 FIT (= 10⁻⁷/h), broken down by failure mode and the diagnostic coverage in place:
| Failure mode | λ contribution | Effect | Diagnostic | DC% | Bucket |
|---|---|---|---|---|---|
| Core stuck-at | 40 FIT | Compute path silently produces wrong output | Lockstep dual-core comparison | 99% | 0.4 FIT residual + 39.6 FIT covered (safe-via-mechanism) |
| Core transient (SEU) | 30 FIT | Transient wrong output, recovers next cycle | ECC + lockstep | 99% | 0.3 FIT residual + 29.7 FIT covered |
| RAM bit-flip | 20 FIT | Wrong data, propagates downstream | ECC (single-bit correct, double-bit detect) | 99% (single), 0% (double) | ~1 FIT residual + 19 FIT covered |
| Lockstep monitor stuck-OK | 5 FIT | Monitor reports "compare OK" even when cores diverge — a multi-point latent | Periodic self-test of the comparator at startup | 50% (only catches half) | 2.5 FIT latent + 2.5 FIT covered |
| Power-supply brown-out | 3 FIT | MCU resets cleanly — fail-safe direction | — | — | 3 FIT safe |
| Clock source drift | 2 FIT | Timing errors, wrong outputs | External watchdog with independent clock | 95% | 0.1 FIT residual + 1.9 FIT covered |
Tally per bucket — every failure mode's covered portion lands in the multi-point-detected bucket because "covered" means a diagnostic redirects to safe state, not that the fault itself is intrinsically harmless:
λtotal = 100 FIT λS (safe) = 3 FIT (power-supply brown-out, fails safe-direction) λSPF = 0 FIT (no fault entirely without a mechanism) λRF (residual) = 1.99 FIT (uncovered fractions of partially-covered SPFs) λMPF,detected = 92.51 FIT (covered by a diagnostic that triggers safe state) λMPF,latent = 2.5 FIT (lockstep-monitor stuck-OK, half not caught by self-test)
The residual breaks down as 0.4 (core stuck-at) + 0.3 (core SEU) + 0.19 (RAM single-bit) + 1.0 (RAM double-bit, no DC) + 0.1 (clock drift). The dominant contributor — the RAM double-bit residual at 1 FIT — accounts for half of all residual fault rate. This is the kind of detail SPFM in Step 2 will surface; PMHF on its own would lose it in the noise.
Step 2SPFM — Single Point Fault Metric
SPFM measures the fraction of fault-rate that is not a single-point or residual fault, computed across the dangerous (non-safe) population. Equivalently: the fraction of the dangerous fault rate that has a working safety mechanism between it and the safety goal. Formally:
SPFM = 1 − (Σ λSPF + Σ λRF) / (Σ λtotal − Σ λS)
The denominator excludes safe faults — they don't matter for the metric. The numerator-of-the-bad-fraction is the part of the dangerous rate that can directly violate the safety goal without a second failure being involved. ISO 26262 Part 5 Table 5 gives the per-ASIL targets:
| ASIL | SPFM target |
|---|---|
| D | ≥ 99% |
| C | ≥ 97% |
| B | ≥ 90% |
| A | No quantitative requirement |
Plugging the AEB MCU bucket totals from Step 1 in:
SPFM = 1 − (0 + 1.99) / (100 − 3)
= 1 − 1.99 / 97
= 1 − 0.0205
= 0.9795 = 97.95%
This MCU passes ASIL C (≥ 97%) but fails ASIL D (≥ 99%) by about a percentage point. The conversation with the design team is exactly the conversation the bucket breakdown teed up: half the residual budget is consumed by RAM double-bit errors with no detection. Either the ECC needs upgrading from SECDED (single-error-correct, double-error-detect — only 99% on single-bit) to SECDED with DUE-trap-to-safe (which moves the 1 FIT into λMPF,detected), or the safety goal has to be met with a different SPFM-friendly mitigation. The metric forces the question; it doesn't suggest the answer.
Step 3LFM — Latent Fault Metric
LFM measures the fraction of multi-point faults that are detected by a diagnostic, recovered, or otherwise prevented from sitting silently in the system waiting for a second fault. Latent faults are the dangerous category here: a stuck-OK watchdog, an ECC mechanism that has degraded, a redundant channel that has failed but not been noticed. The system looks healthy until the second fault hits.
LFM = 1 − (Σ λMPF,latent) / (Σ λtotal − Σ λS − Σ λSPF − Σ λRF)
The denominator is the multi-point-fault population only — i.e. faults that aren't safe and aren't single-point. The numerator is the latent (undetected) subset. ISO 26262 Part 5 Table 6 thresholds:
| ASIL | LFM target |
|---|---|
| D | ≥ 90% |
| C | ≥ 80% |
| B | ≥ 60% |
| A | No quantitative requirement |
For the AEB MCU:
LFM = 1 − 2.5 / (100 − 3 − 0 − 1.99)
= 1 − 2.5 / 95.01
= 1 − 0.0263
= 0.9737 = 97.37%
This passes ASIL D (≥ 90%) comfortably. The MCU's only meaningful latent is the lockstep-monitor stuck-OK fault (2.5 FIT undetected), and even that is a small fraction of the multi-point pool dominated by the covered core and RAM faults.
Two structural points about LFM that change how design teams think about it:
- LFM is dominated by the diagnostics on diagnostics. The lockstep-monitor self-test in our example is the LFM-relevant mechanism — it's not protecting against the original core fault, it's protecting against the lockstep monitor itself failing silently. Every safety mechanism in an ASIL D component needs its own diagnostic, otherwise it sits in λMPF,latent at full rate. This is the structural reason ASIL D safety-mechanism design produces nested verification: every monitor has a meta-monitor, recursively, until the rates are low enough.
- LFM uses a different denominator from SPFM. The set of faults that count in each metric is different, which means a high LFM doesn't help SPFM and vice-versa. Teams that compute one from the other — "we have 99% DC on the watchdog, so LFM and SPFM should both be 99%" — miss the structural difference and produce wrong numbers. Always compute both from the bucket totals separately.
Step 4PMHF — Probabilistic Metric for Hardware Failures
PMHF is the rate per hour at which the safety goal is violated due to random hardware failures, averaged over the operational lifetime of the vehicle. Where SPFM and LFM are dimensionless ratios, PMHF is a rate — comparable directly against numerical targets without further interpretation. The Part 5 Table 6 thresholds:
| ASIL | PMHF target |
|---|---|
| D | ≤ 10⁻⁸/h (10 FIT) |
| C | ≤ 10⁻⁷/h (100 FIT) |
| B | ≤ 10⁻⁷/h (100 FIT) |
| A | No quantitative requirement |
For a single-channel architecture with safety mechanisms — which is what our AEB MCU is, considered standalone — the PMHF contribution decomposes as:
PMHF ≈ Σ λRF ← residual fraction of single-point faults
+ Σ λMPF,latent · λpartner · Tlife / 2 ← latent + partner during lifetime
The residual term contributes directly — every uncovered fraction of a single-point fault adds to the rate at which the safety goal is violated. The latent term is the probabilistic combination of "the latent fault has already occurred" with "the partner fault then occurs during the remaining lifetime"; the factor of Tlife/2 is the average exposure window for a uniformly-distributed second-fault arrival. ISO 26262 Part 5 §9.4.2.3 gives the full decomposition with refinements for proof-test intervals and detected multi-point faults; the simplified form above captures >95% of the contribution for typical automotive mission profiles.
Plugging the AEB MCU buckets in, with vehicle lifetime Tlife = 10,000 h and the latent's partner-fault rate ≈ 70 FIT (the cores and RAM the lockstep monitor protects):
Residual term: Σ λRF = 1.99 FIT = 1.99×10⁻⁹ /h
Latent term: 2.5×10⁻⁹ · 70×10⁻⁹ · 10,000 / 2
≈ 8.75×10⁻¹³ /h ← negligible
PMHF ≈ 1.99×10⁻⁹ + 8.75×10⁻¹³ ≈ 2.0×10⁻⁹ /h
2×10⁻⁹/h vs an ASIL D target of 10⁻⁸/h. Passes ASIL D with a 5× margin. The latent contribution is four orders of magnitude smaller than the residual contribution — an extreme version of the general pattern, where for component-level PMHF the residual fault rate is what matters and the latent term is a rounding error unless λpartner × Tlife is unusually large.
Step 5Unified verdict on the AEB MCU
The three metrics evaluated against ASIL D targets, side by side:
| Metric | Computed | ASIL D target | Verdict |
|---|---|---|---|
| SPFM | 97.95% | ≥ 99% | FAIL by 1.05 pp |
| LFM | 97.37% | ≥ 90% | PASS by 7.4 pp |
| PMHF | 2.0×10⁻⁹/h | ≤ 10⁻⁸/h | PASS with 5× margin |
The MCU does not meet ASIL D. It meets ASIL C on all three, comfortably. The single failing metric is SPFM, and the single dominant contributor inside SPFM is the RAM double-bit residual at 1 FIT. Three credible design responses:
- Upgrade ECC to detect and safely handle double-bit errors. SECDED with DUE-trap-to-safe (the double-bit error triggers a controlled reset to safe state) moves the 1 FIT from λRF into λMPF,detected. New SPFM:
1 − 0.99/97 = 98.98%. Still fails ASIL D, but only by 0.02 pp — close enough that further small improvements (e.g. tightening the 95% clock-drift DC to 99%) push it over. - Re-architect to eliminate the high-residual component. A different memory controller with intrinsically lower failure rate, or a different memory technology (FRAM, MRAM) where the dominant failure mode isn't bit-flips. Expensive but cleanly removes the gating constraint.
- Apply ASIL decomposition. The MCU stays at ASIL C; the safety goal moves to ASIL B(D) + ASIL B(D) at the system level (cf. Article 6), and the MCU's per-channel ASIL drops to B. SPFM target relaxes to 90%, which the MCU clears comfortably. The decomposition has its own independence-and-CCF cost, but at the MCU level the metric pressure dissolves.
Notice what didn't make the list: "improve the safety mechanisms we already have". The SPFM shortfall is concentrated in a fault mode that has no diagnostic at all (RAM double-bit). Tightening lockstep coverage from 99% to 99.5% wouldn't shift the answer — it improves a residual that's already small. The metric structure tells the design team where to look; the FMEDA tells them what to fix; the decision is which of the three responses fits the cost / schedule / risk envelope.
The other thing the table makes obvious: this MCU passes PMHF and LFM with substantial margin and fails SPFM by a hair. A team that had only computed PMHF would have shipped this for ASIL D, and the audit would have caught it. The three-metric structure is what prevents the architecture from passing ISO 26262's hardware integrity claim with a fault profile that the standard's authors knew was dangerous. Compute all three. In that order. Every time.
Five pitfalls a reviewer will catch
- Over-claiming "safe" faults. A failure mode is in λS only if it provably cannot violate the safety goal across the operational profile — temperature range, voltage range, all input patterns, all internal states. The fail-safe direction has to be proven, not asserted. Reviewers' standard question: "show me the fault-injection test that confirms this fault mode produces a safe state in the worst-case operational corner". Failing to defend a λS claim shrinks the safe pool, which moves rate into λSPF or λRF and tanks SPFM.
- Datasheet DC values used directly. Vendor datasheets publish coverage figures ("ECC catches 99.9% of single-bit errors") under specific assumptions: nominal access pattern, room temperature, typical refresh rate. The actual coverage in your system depends on how the part is used. ISO 26262 Part 5 §B.3 requires DC values to come from FMEDA validation — typically a structured fault-injection campaign (the SAE J3187 framework, or Synopsys/Cadence simulation tools) — not from datasheets. Reviewers ask for the FMEDA validation report; "vendor says 99%" doesn't survive that question.
- Diagnostic test interval forgotten. A watchdog with a 100 ms timeout catches faults within 100 ms. A power-on self-test catches faults at restart only. ISO 26262 Part 5 §B.4.2 reduces the effective DC by the fraction of the operational time the diagnostic isn't actively running — which for periodic diagnostics is meaningful. A "99% DC" diagnostic that only runs once a drive cycle has effective DC under 50% in steady-state operation. Compute test interval explicitly; it shows up as part of the FMEDA.
- Component-level pass ≠ safety-goal-level pass. SPFM, LFM and PMHF are computed per component. The safety goal is enforced by an architecture composed of many components, and ISO 26262 Part 5 §9.4.2.5 requires the metrics to be aggregated up to the safety-goal level. A radar ECU passing all three metrics for ASIL D and a fusion ECU also passing for ASIL D doesn't guarantee that the combined system meets ASIL D — common-cause failures, integration faults, and propagation paths between components can degrade the architecture-level metric. The component-level pass is necessary, not sufficient.
- DC values not re-validated when the operational profile changes. A DC measured at 25 °C, 5.0 V, nominal access pattern doesn't apply at −40 °C, 4.5 V, or atypical access patterns. If the vehicle's operational profile expands (a radar from passenger-car validation now used in a commercial-truck programme), the FMEDA needs re-running. The most common change-impact-analysis miss in production is reusing DC values across programmes without re-validation; reviewers spot it by asking for the operational-profile envelope referenced in the FMEDA.
Where to go next
- Run the FMEDA on your own component. Open FTA Studio — the Failure Rate Database (Enterprise) ships with FIDES, IEC 62380 and Siemens SN 29500 reliability data feeding directly into the FMEDA template. The bucket totals come out as automatically as the metrics.
- Cross-check against ASIL decomposition. Article 6 covers the architecture-level alternative: when component-level ASIL D is too expensive, drop to ASIL B via decomposition and the per-component metric targets relax accordingly.
- For the CCF interaction, Article 5 covers β-factor and MGL — relevant because PMHF for a multi-channel architecture has an explicit β·λ term that often dominates the answer at the architecture level.
- For the underlying λ values, our failure-rate reference page covers the typical FIT ranges for automotive components by industry source. Use validated data, not handbook numbers from outside your domain.