Home / Guides / Monte Carlo for FTA — lognormal vs uniform
Uncertainty · Comparative

Monte Carlo for FTA — lognormal vs uniform leaf distributions

A point-estimate fault tree gives one number for P(TOP). Real engineering data isn't a point estimate — it's a published median with a confidence interval, or a manufacturer-supplied range, or an extrapolation from a limited fleet sample. Monte Carlo propagates the leaf-event uncertainty through the Boolean structure of the tree and produces a percentile band on the top, which is what regulators in NRC PRA, ARP 4761 SSA at DAL-A, and increasingly ISO 26262 ASIL-D submissions want to see. The leaf distribution you pick — lognormal, uniform, triangular, beta — shapes the band, and picking the wrong one produces a defensible-looking number that's wrong by an order of magnitude. This guide covers the distribution choice, the sampling mechanics, and a worked SPAD tree end-to-end.

≈ 17 min read Worked tree: rail / SPAD (Article 1) References: NUREG-CR-6823, ASME/ANS PRA Standard

Why a point estimate isn't enough

Suppose your fault tree's basic events come from MIL-HDBK-217F. The handbook publishes a failure rate for each component class — say, λ = 5×10⁻⁷/h for a vital signal lamp — but the underlying data has a confidence interval. The actual rate, in the field, on this exact production batch, in this exact climate, is uncertain. Even handbook-published values typically come with an error factor of 3–5: the true rate plausibly lies anywhere in [λ/5, 5λ]. Plug the median into a Boolean cut-set sum and you get a single number for P(TOP) — but that number's own uncertainty is a function of every leaf's uncertainty propagated through the tree's structure.

Three reasons this matters in practice:

  1. The point estimate can be far from the median of the band. Cut-set sums of lognormal-distributed leaves are themselves heavy-tailed; the mean of P(TOP) is typically larger than the median by a factor that grows with the number of cuts and the leaf-distribution spread. A safety case that quotes only the median can be 2–3× under the actual mean — and it's the mean that determines tolerable-risk comparisons in most regulatory frameworks.
  2. The 95th-percentile band tells the design team where the risk lives if the data is pessimistic. A 5th–95th band that spans two orders of magnitude on P(TOP) is a different kind of safety case from one that spans 30%. The first warrants design margin; the second is "we've nailed it". Both can have the same point estimate.
  3. Regulators increasingly require Monte Carlo bands. NRC PRA submissions have used percentile bands since the 1980s. ASME/ANS RA-Sa-2009 PRA Standard codifies the requirement. ARP 4761 SSAs at DAL-A typically include them. ISO 26262 doesn't mandate them but reviewers ask "what's the uncertainty on this PMHF figure?" routinely at ASIL D.
What "Monte Carlo on a fault tree" actually computes For each of N samples (typically 10⁴–10⁶): draw a value for every basic event from its declared leaf distribution, evaluate the tree's Boolean expression with those values, record the resulting P(TOP). After all samples, the empirical distribution of P(TOP) values is the propagated uncertainty. Percentile statistics (5th, 50th, 95th) and moments (mean, variance) are read off directly. Importance measures from Article 2 get their own bands by the same mechanism — F-V importance becomes "F-V at 5th percentile vs at 95th", which is the regulator-friendly way to express ranking robustness.

Step 1Which leaf distribution — and why lognormal is the default

Four distributions cover essentially every basic event in practice. The choice per leaf depends on what the data source actually published.

DistributionWhen to useParametersTail behaviour
Lognormal Default. Most reliability data is published with a median and an error factor (EF), where EF = √(95th/5th). Lognormal is the maximum-entropy distribution given those two summaries. NUREG-CR-6823 §6 makes lognormal the default for nuclear PRA. μ (median), σ (log-space scale). Often expressed as median + EF: σ = ln(EF)/1.645. Heavy right tail; mean > median; ratio mean/median ≈ exp(σ²/2).
Uniform When the data source publishes a min and max only ("the rate is between 10⁻⁷ and 10⁻⁵"), with no median or shape information. Maximum-entropy on a bounded interval. Honest about lack of knowledge. min, max. Symmetric, bounded; mean = (min+max)/2.
Triangular When the data source gives min / mode / max (commonly from elicited expert opinion in Bayesian PRA). Mode is the engineer's best guess; min/max bracket where the engineer would be surprised. min, mode, max. Bounded; mean = (min + mode + max)/3 (not mode-centred).
Beta For per-demand probabilities (sensor-test PFD, human-error rates) when bounded between 0 and 1 and the data source has a posterior from Bayesian update on operational counts. α, β shape parameters from successes and failures observed. Bounded [0,1]; flexible shape; conjugate prior to binomial likelihood.

Why lognormal dominates in practice

Two reasons lognormal is the default for component failure rates, and uniform / triangular only get used when lognormal can't be honestly justified:

Uniform and triangular are valid choices but each has a cost: uniform overstates the probability mass at the extremes (the actual rate is rarely uniformly distributed across two orders of magnitude — it's almost always concentrated near the median). Triangular's mode-centring is honest if you genuinely have a mode estimate but produces unnaturally sharp peaks if the elicitation was actually "the rate is somewhere in this range and I'm not sure where". Beta is correct for bounded per-demand probabilities and wrong for failure rates (which can in principle exceed 1/h).

The "default lognormal, justify anything else" convention A safety-case Monte Carlo specification declaring lognormal for every leaf is what an NRC PRA reviewer expects to see. Specifications declaring uniform or triangular for many leaves trigger "why?" questions per leaf. The defensible pattern: lognormal as the default, uniform when the data source genuinely gave only min/max, triangular only when an explicit elicitation produced a mode estimate, beta only for per-demand probabilities. Document the choice per leaf in the FTA's data-source table; reviewers look there first.

Step 2Sample size, convergence, and correlated leaves

Once leaf distributions are chosen, three operational questions determine whether the Monte Carlo run is defensible: how many samples are enough, how do you know the run has converged, and what to do when basic events aren't independent.

Sample size and the 1/√N convergence law

The standard error of any Monte Carlo estimate scales as 1/√N. Halving the standard error costs 4× the samples; reducing it by 10× costs 100×. For typical PRA-scale fault trees the practical sample sizes are:

NUse caseApproximate precision
10⁴First-pass exploration; design-review presentation±5–10% on median; ±20% on 5th/95th percentiles
10⁵Safety-case standard run; what most regulators expect to see for SIL 3 / DAL B claims±2% on median; ±5–10% on 5th/95th percentiles
10⁶Tight 99% CI estimation; rare-event quantification (PMHF at ASIL D, THR at SIL 4)±0.5% on median; ±2% on 99th percentile
10⁷Research-grade convergence verification; rarely needed in production safety cases±0.2% on percentile statistics

A useful pattern: run 10⁵ samples for the official safety-case number, plus a second independent 10⁵ run with a different random seed to confirm that the first-decimal-place percentile values agree. The two runs together act as a convergence diagnostic without committing to 10⁶ samples.

Variance reduction — Latin Hypercube Sampling

Naive Monte Carlo draws each leaf-event independently. Latin Hypercube Sampling (LHS) stratifies each leaf's distribution into N equal-probability bins and ensures each sample uses one bin from each leaf — guaranteeing the input space is covered uniformly rather than relying on randomness to fill it. For typical fault tree structures (50–500 leaves, OR/AND mix), LHS reduces the standard error by roughly 5–10× for the same sample count. NUREG-CR-6823 §6.3 documents the gain on representative PRA trees; the cost is implementation complexity and a slight loss of independence between samples that has to be checked at the analysis end.

For a 10⁵-sample LHS run, the precision is comparable to a 10⁶-sample naive Monte Carlo run, while finishing in ~10% of the wall time. Production PRA tools (SAPHIRE, Riskman, FTA Studio's lognormal Monte Carlo at /tools/lognormal-monte-carlo) use LHS by default; analyst-built scripts often use naive MC and give up the variance-reduction win without realising.

Correlated leaves — the trap that wrecks precision claims

Drawing every leaf event independently is the implicit assumption of basic Monte Carlo. It's wrong whenever basic events share a vendor, a manufacturing batch, a firmware build, a maintenance crew, or an environmental zone. In those cases the events are positively correlated — when one fails high, the other tends to fail high too — and treating them as independent overstates the precision of P(TOP) substantially.

Two ways to model correlation in the Monte Carlo:

For most PRA-scale trees, the common-factor approach is preferred because it integrates with the existing β-factor / CCF analysis. The common-cause basic event already exists in the tree (it was added during CCF modelling); Monte Carlo just samples it normally. The only thing to verify is that the residual independent rates are themselves drawn independently — a sanity-check on the implementation, not a different analysis.

The most common error: claiming independence everywhere Default Monte Carlo settings in most safety-tool implementations sample every leaf independently by default. For a fault tree with declared CCF groups, this silently breaks the correlation that the CCF group was supposed to model. The fix is configuration, not analysis: ensure the Monte Carlo respects the CCF groups by sampling the common-cause basic event jointly with the components it covers. Reviewers ask "does your Monte Carlo respect the CCF declarations from the FTA?"; the right answer is "yes, the common-cause basic events are sampled once per iteration and applied to all components in the group", not "the leaves are sampled independently from their declared lognormals".

Step 3Worked SPAD tree with Monte Carlo bands

Pulling the SPAD tree from Article 1 back into view: 8 minimal cut sets, point-estimate P(TOP) = 4.65×10⁻³ per train-year. Now treat each leaf as lognormal rather than a point. Per the standard rail-data-source convention (IEC TR 62380 publishes signal-lamp rates with EF ≈ 3, FIDES 2009 publishes electronic-component rates with EF ≈ 2–4), declare error factor EF = 3 on every basic event:

EF = √(95th / 5th) = 3
σlog = ln(EF) / 1.645 = ln(3) / 1.645 ≈ 0.668

Each leaf's sampling distribution is then lognormal with the article's median rate as the median and σlog = 0.668 as the log-space scale. Run 10⁵ Latin Hypercube samples through the tree's Boolean structure.

Percentile band on P(TOP)

P(TOP) is itself approximately lognormal (Fenton-Wilkinson) because the dominant cut {BE-001} contributes 94% of the median and is itself lognormal. The empirical distribution of the 10⁵ samples produces:

StatisticP(TOP) per train-yearEquivalent per train-hour
5th percentile1.55×10⁻³1.77×10⁻⁷ /h
50th percentile (median)4.65×10⁻³5.31×10⁻⁷ /h
Mean5.81×10⁻³6.63×10⁻⁷ /h
95th percentile1.39×10⁻²1.59×10⁻⁶ /h

Three observations the table makes obvious that the point estimate hid:

F-V importance under uncertainty

Importance measures from Article 2 become bands too. Compute F-V per basic event for each Monte Carlo iteration; the empirical distribution of those F-V values across 10⁵ iterations gives the importance band:

Basic eventF-V 5thF-V medianF-V 95thRank stable?
BE-001 (lamp wrong-side)0.850.940.99Yes — always rank 1
BE-002 (controller wrong-side)0.0050.0380.12Mostly — sometimes flips with BE-003 at 5th
BE-003 (cable wrong-side)0.0030.0190.08Mostly — sometimes flips with BE-002
BE-004..BE-009< 0.001 each0.001–0.0030.005–0.015Yes — always near bottom

The headline result is that BE-001 is robustly the dominant contributor across the entire band — making any reliability-budget allocation toward signalling lamps a defensible architectural decision regardless of leaf-data uncertainty. The BE-002 vs BE-003 ranking flips occasionally at low percentiles, but neither is meaningfully actionable at the 1–4% F-V level. The rank order's structural conclusion ("address the lamp first") survives the uncertainty propagation without modification.

This is the kind of statement Monte Carlo enables that the point estimate cannot. "F-V importance ranks BE-001 first" is a defensible safety-case claim only if the rank holds under the data's plausible variation. The Monte Carlo answers that question explicitly.

What the band tells you about design margin

If the corporate tolerable threshold is 5×10⁻³ /train-year:

The point-estimate analysis would have reported "passes". The Monte Carlo reports "passes by median, fails by mean, fails by 95th". The design conclusion changes: either tighten leaf-data uncertainty (more proof testing, more operational data), tighten the architecture (additional barriers per Article 1's wrong-side correction), or negotiate the metric basis with the regulator (median is sometimes acceptable if the leaf data has an audit-defended low EF).

The Monte Carlo result is what makes the safety case actually defensible A point-estimate FTA that beats target is dismissable: "your input data is uncertain by an order of magnitude, so what does that 4.65×10⁻³ number mean?". The Monte Carlo answers explicitly: median 4.65×10⁻³, 95th 1.39×10⁻², driven by leaf-data error factor EF = 3 declared per IEC TR 62380. The reviewer doesn't have to guess at the uncertainty; the analyst handed them the band. That's the difference between a safety case that survives review and one that doesn't.

Five pitfalls a Monte Carlo reviewer will catch

Where to go next