13. War-stories: the failure-mode catalogue¶
Every rule in this book was bought with a bug. The disciplines in the preceding chapters (shift the signal, gate on the lower bound, deflate for the sweep) did not arrive as wisdom; they arrived as post-mortems. This chapter is the ledger: it collects the failures, sanitised, and for each one names the consequence and the rule it forced into the framework.
The reason to keep a catalogue rather than a tidy list of best practices is that the same bug class recurs forever if you only fix instances. A researcher who never saw the same-bar leak will write it again next quarter, in different pandas, and it will look like ordinary code. Best practices are advice; a catalogued failure mode is a gate. And note who caught these: mostly not the author. You do not find your own look-ahead bug by re-reading the code that contains it; you find it with an independent path that disagrees.
We group the catalogue into five families, because the root causes cluster:
graph TD
A[A backtest result you believe] --> LA[Look-ahead<br/>the result is fiction]
A --> VH[Validation honesty<br/>the result is luck]
A --> TR[Tail & cost realism<br/>the result is fragile]
A --> DP[Deployment parity<br/>the result isn't what runs]
A --> SD[Selection discipline<br/>the result was cherry-picked]
How to read the A-codes and V-eras
Two stable labels recur across the book, and they're worth defining once, here, since this catalogue is where they originate. An A-code (A1, A6, A11, …) is a stable identifier for a catalogued audit finding / failure mode — an entry in this chapter — so a fix, a test, or another chapter can cite the exact lesson without re-telling the story. A V-era (V1-era, V2, V3.x) is a methodology version: the framework's gates get stricter over time (a serially-aware bootstrap, deflation-for-N, portfolio-level promotion), so a result is tagged with the framework version that blessed it — a "V1-era" Sharpe predates rules a "V3" result had to clear, which is why the caveats chapter treats un-retagged numbers as unconfirmed. The codes this book actually cites:
| Code | Failure mode → the rule it bought | Enforcement today |
|---|---|---|
| A1 / A2 | Same-bar look-ahead — lag any series before it multiplies a return | CI gate (value-corruption causality test) |
| A3 | Total-return vs price-only confusion — make TR/price-only explicit in every cost & return signature, and write data hypotheses before believing a surprising RETIRE | prose + typed signatures |
| A4 | "OOS" by partition, not provenance — per-fold selection or genuine pre-registration, plus a hard sanctuary / CI_lo veto in the decision function |
API (decision-fn veto) + process |
| A5 | Deflation at the survivors' count — DSR's N and trial-variance come from the full screener pool; survivors-only mode self-flags optimistic; family-wise / FDR across a program |
prose + API flag |
| A6 | The tail bootstrap resampled the effect — resample the underlying's returns and re-run the strategy on each path | prose / safe-by-construction |
| A7 | A scaling overlay post-multiplied onto P&L — wire the overlay into the engine so it changes the trades, costs, and fills | prose |
| A8 | "Validated against X" with no artifact — require a checked-in artifact and a reproducible test that regenerates it | prose (target: CI) |
| A9 | Strategy guide ≠ deployed config — match parameter-for-parameter, verified | prose (target: CI) |
| A10 | Parity tested against a mirror — compare the live signal against an independent reference, never the research code re-used as its own oracle | parity harness |
| A11 | Parity not run to the integer position — parity across the whole chain (data → feature → signal → size → units), including native-leverage tier scaling | parity harness |
| V1-era / V2 / V3.x | A methodology version: gates get stricter over time, so an un-retagged number is unconfirmed until re-run |
verdict tagging |
| V3.1 / V3.2 | Pre-commit the selection rule; select the plateau, not the peak | process |
| V3.4 | Prove which components carry the edge by ablation; judge ballast on p_kill, not a return metric |
process |
| V3.5 | Drawdown breakers are failsafes, not the primary control (the continuous size haircut is primary) | design |
| V3.6 | Document every dead end so it isn't silently re-run without fresh pre-registration | process |
The codes are stable: a newly-catalogued failure takes the next number and keeps it, so "A6" means the same thing in a backtest you can trust, the preflight checklist, and here. The Enforcement column is the same prose → safe-API → CI-gate strength ladder the end of this chapter defines: most lessons are still prose (memory), and maturing the stack means dragging each one rightward to "the build won't let you."
How a bug becomes a catalogue entry
The codes don't assign themselves, so it's worth stating the intake path once — generically, because it transfers to any team keeping its own ledger. An entry is minted when an independent path disagrees: an internal re-derivation that won't reproduce a number, or an external adversarial audit that surfaces a failure the original author's tests missed (the recurring theme of this chapter — you don't catalogue your own blind spot by re-reading the code that has it). The finding takes the next free A-code and keeps it; what gets recorded is the lesson and its enforcement state, not just the incident, so the row can later be dragged from prose toward CI gate. The discipline that matters is not the bureaucracy of who assigns the number — it's that the register is append-only and single-source: one place every chapter, test, and fix cites, so "A6" can never quietly come to mean two things.
Family 1. Look-ahead: the result is fiction¶
These are the bugs where the backtest measured something that could not have been traded. They are the most dangerous family because the equity curves they produce are not merely good, they are plausible: a same-bar leak on an autocorrelated signal manufactures exactly the smooth, persistent P&L a real edge would.
War-story: the signal that earned the bar it was looking at
A regime-rotation strategy chose its position at the close of bar t using a momentum rank computed from data including bar t, then collected bar t's return on that position. Decision and reward lived in the same time window. Because momentum is serially correlated, the leak didn't add noise; it added a beautiful, fake trend. An audit found the identical pattern in four separate places in one codebase, each a signal series multiplied by a contemporaneous return, none caught in review because each read as routine pandas (ret.where(cols == winner)).
Consequence. The stitched out-of-sample inflation was strongly instrument-dependent, small and negative on a noisy series, materially positive on a trending one (illustrative: order of a few percent per year), enough to flip a flat strategy to a deployable one. The rule it bought (A1/A2): any series that multiplies a return must be .shift(1)'d first, unless you can prove the position was knowable strictly before the return's window opened. Treat same-bar position * return as guilty until proven innocent, and verify with a value-corruption causality test, not by eye; the shift discipline is worked end-to-end in A backtest you can trust.
The subtler cousin of same-bar collect is the bounded leak. That same audit found the inflation depends on position persistence: a strategy that holds until its prediction flips has few transition bars, so the leak is small; a fresh-decision-every-bar strategy can collapse when fixed. The lesson is not "the leak is sometimes harmless" (fix it regardless) but quantify it before you assume it dominates, because the size of the bias tells you whether there was ever an edge underneath.
A second look-ahead shape hides in feature construction. A z-score over the whole series ((x - x.mean()) / x.std()) lets every historical bar "know" statistics from bars that hadn't happened yet; it never touches a return directly, which is why it survives review. The fix is causal or in-sample-frozen normalisation, and the deeper pattern is to make the dangerous version absent from the API: Titan's metrics module offers no full-series z-score, so it cannot be called by accident. Cross-timeframe aggregation is the same trap in a different coat: forward-filling a higher-timeframe signal onto past lower-timeframe bars once turned a true-zero-edge strategy into a phantom with a Sharpe near +2 (illustrative), all of which evaporated once the fill was made causal. The discipline: .shift(1) then reindex-with-ffill, wrapped in a causality assertion.
| Look-ahead bug | What it looks like | The rule it bought |
|---|---|---|
| Same-bar collect (A1/A2) | position * return with no lag |
Lag any series before it multiplies a return; value-corruption causality test in CI |
| Future-normalised feature | (x - x.mean()) / x.std() over all time |
Only causal / IS-frozen normalisation exists in the API |
| Cross-TF ffill | higher-TF signal ffilled onto past bars | .shift(1) then reindex; wrap in assert_causal |
War-story: when 'is it a data bug?' is the right first question
A cross-sectional momentum audit came back uniformly negative, contradicting a published edge. Before blaming the strategy, we asked the cheaper question: is the data wrong? It was: the download kept price-only close and dropped total-return adj_close, systematically under-ranking dividend payers. We re-downloaded with total returns and re-ran; the result got more negative, not less. The rule it bought (A3 + the falsification discipline): total-return vs price-only must be explicit in every cost-model and return signature, and for any surprising RETIRE you write two or three data-construction hypotheses and test the cheapest first. Here the fix confirmed the verdict, but had it flipped the sign, we'd have caught a false retirement. The fifteen "wasted" minutes are the insurance that makes a negative result trustworthy.
Family 2. Validation honesty: the result is luck¶
The look-ahead family is about whether the number is real. This family is about whether a real number is durable: whether it survives being one of many tries, a thin sample, or a single lucky regime.
The headline failure is the co-committed pre-registration. A scripted git scan of one program found that almost every pre-registration file was first committed in the same commit as its own verdict: no timestamp evidence that the canonical cell, the fold count, or the thresholds were fixed before the run. That doesn't make the verdicts wrong; it makes the entire deflation defence (multiple-testing correction, cross-selection penalty) unverifiable, in both the RETIRE and DEPLOY directions. A pre-registration you can edit after seeing the result is a diary, not a contract.
War-story: the deflation that deflated nothing
A deploying audit fed its Deflated-Sharpe calculation a hand-picked four-cell "plateau" as both the trial count N and the variance-across-trials. The full sweep had been far larger; the deflation was computed as if only four hypotheses had ever been tried, and the optimism went unflagged because the survivors-only mode defaulted to silent. The strategy carried real weight in the live book.
Consequence. Too-weak deflation on a strategy certifying capital. The rule it bought (A5): for any sweep with more than a handful of cells, the DSR's N and trial-variance come from the full screener pool, not the survivors; survivors-only mode must self-flag as optimistic; and across a research program you need a family-wise / FDR control, because fifty independent audits each tested against a fresh 0.95 gate will eventually deploy a luck-survivor.
Two quieter failures sit underneath. First, the lower bound that lied: an IID bootstrap resamples bars independently, destroying the serial correlation that trend and carry strategies live on, which narrows the interval and biases the lower bound upward, the exact number you gate on. A stationary block bootstrap fixes it. Second, the sanctuary that couldn't rescue, and the matrix that let it try: holding out the most recent twelve months is mandatory because that year is often anomalously strong, but a positive sanctuary can never lift a strategy whose walk-forward lower bound is negative. A count-of-best-axes decision matrix has no hard veto: a catastrophic hold-out Sharpe of −0.5 scored identically to a mediocre axis, so a strategy that lost money on the hold-out year still earned a non-RETIRE tier. The rule (A4 + the veto): out-of-sample requires per-fold selection or genuine pre-registration, and any "worst" on the sanctuary or CI-lower-bound axis caps the verdict regardless of the other axes.
Negative results are output, not waste
A failed audit is data. One research cycle produced dozens of nulls: IC censuses with zero survivors, strategies retired at the plateau gate, confluence tests where gating destroyed the signal, each logged with its failure mechanism and the decision rule applied. The rule it bought (V3.6): document the dead end so the next researcher (or you, in six months) doesn't re-run it without fresh pre-registration. A catalogue of what doesn't work is as load-bearing as the list of what does. Whole families of pre-2014 published edges now get framed as falsification tests, not replication targets, precisely because the catalogue made the decay pattern visible.
Family 3. Tail and cost realism: the result is fragile¶
A Sharpe can be real, durable, and still describe a strategy you cannot survive. This family is about the numbers that decide livability: drawdown geometry, tail loss, and the costs that quietly eat the edge.
A Sharpe past every statistical gate still hid a double-digit, year-plus drawdown no committee would hold: it bought Calmar lift (not Sharpe lift) as the primary promotion metric; full telling in the metric suite and tail risk & ruin.
The cost side is its own graveyard. A pure bps_per_turnover model under-prices reality the moment notional is small, the broker charges a per-fill commission floor, or a vol-target overlay produces many tiny daily rebalances. A live cost audit on a real paper account found the true drag was several times the modelled drag (illustrative: a low-double-digit bps/yr estimate that grew to roughly five times that once the gaps were closed): the model had missed the commission floor, mis-calibrated the ETF leg, and counted sub-threshold rebalances the live class would have skipped. It didn't flip the verdict, but it turned a comfortable margin into a thin one. Continuous futures legs are worse: an always-on position pays nothing in most models for mandatory quarterly rolls (on the order of tens of bps/yr per instrument), and research exits at the clean close while live stops gap through it.
A tail gate resampled the strategy's own returns instead of the underlyings, re-confirming the realised path; (A6) Monte Carlo perturbs inputs, not outputs, and long-only sleeves need a relative (vs buy-and-hold) MaxDD gate; full mechanism in tail risk & risk of ruin.
A close relative of the MC bug is the overlay applied to the wrong layer. It is tempting to model a position-scaling overlay (vol-targeting, a Kelly haircut, a regime gate) by computing the base strategy's returns and then post-multiplying by a scale series. That is wrong whenever scaling changes turnover, costs, or the interaction with stops, which it almost always does. A 0.5× scale isn't half the return; it's a different trade with different fills.
Position scaling changes the engine, not the output
The rule (A7): a scaling overlay must be wired into the backtest engine so it changes the positions the engine actually trades (and therefore the costs, the fills, and the drawdown path), not bolted on as a multiplier after the P&L is computed. The same discipline catches a related class of fragility: bare-threshold regime gates (signal >= K) flip on noise near the boundary and fail an input-noise robustness test. Prefer percentile gates, ensembles, or continuous (sigmoid) scaling, and remember that input-noise robustness and parameter-plateau robustness are different tests; a strategy can pass one and fail the other.
Family 4. Deployment parity: the result isn't what runs¶
The most expensive gap in a quant stack is not in the research; it is between research and the live code. You can pass every gate in Part II and still lose money because the thing that trades is not the thing you validated.
War-story: the strategy guide that didn't match the deployed config
A live strategy's user guide documented one set of parameters; the deployed configuration ran another, unreconciled after a tuning pass. The guide, the document an operator reaches for at 2 a.m. during an incident, described a system that wasn't running. The rule it bought (A9): the strategy guide must match the deployed config parameter-for-parameter, verified, not trusted. A guide that disagrees with the live .toml is worse than no guide; it actively misleads the person trying to fix a live position.
The deeper parity problem is that "validated" often means "validated something." An external audit found a live equity sleeve built on best-of-N current index constituents, the exact survivorship-plus-selection construction a previous strategy had been retired for, in production with docstring Sharpes the audit itself called "implausibly high." The validation existed; it just didn't validate the thing that shipped.
War-story: 'validated against X' with no artifact
A claim that a strategy had been "validated against" a reference appeared in a docstring with no checked-in artifact and no reproducible test behind it. Nothing to re-run, nothing to diff, nothing to fail in CI. The claim was load-bearing for a deployment decision and unfalsifiable.
Consequence. A deployment justified by a sentence rather than a test. The rule it bought (A8): "validated against X" requires a checked-in artifact plus a reproducible test that regenerates it. If you cannot point to a file and a command that re-derives the comparison, the validation does not exist. Same instinct as the frozen-ML-artefact rule: a model file is a build artefact of its feature pipeline: embed the feature names, assert compatibility on load, and put a one-row prediction in CI, or the only warning you get is a crash on the first live bar.
The parity test itself has a precise shape, and getting it wrong gives false comfort. It is not enough to check that the live signal exists; you must check that the live on_bar signal at t equals the vectorised research signal at t, computed by an independent reference down the full chain (data load, feature build, signal, sizing), with a causality test confirming the live path can't see the future either. A parity test that re-uses the research code as its own reference proves only that the code equals itself.
War-story: the leveraged instrument sized as if it weren't
A live strategy class sized its position in instrument units without accounting for an instrument that carried native leverage: the contract already embedded a multiplier the research-side sizing had folded in differently. The live class needed to scale by tier to match the validated exposure; it didn't, so the live position didn't match the size the research had risk-checked.
Consequence. Live exposure diverging from validated exposure on a leveraged instrument, the most dangerous place for a sizing error, because leverage scales the loss and the ruin probability by the same factor. The rule it bought (A10/A11): parity tests run end-to-end through the full chain with an independent reference and an explicit causality test, and the live class must scale by tier whenever the instrument has native leverage. Sizing is not a detail you eyeball; it is a contract the parity test enforces.
War-story: the two safeguards that agreed with each other and were both wrong
To stop the live system from running the wrong strategy set, two structural checks were added: a pre-flight gate and a monitoring dashboard, each carrying its own copy of "what the live bundle contains." A later audit found that copy had drifted from the authoritative registry: both the gate and the dashboard listed strategies that the runner had retired months earlier. Crucially, the two checks were mutually consistent — they validated against each other — so both reported green while describing a portfolio that wasn't running. A safeguard that checks one belief against a second copy of the same belief proves only that you wrote it down twice.
Consequence. The operator-facing tooling (and an audit reading it) asserted strategies were trading that weren't, and could not tell which were live. The rule it bought: every derived view of the deployment — pre-flight gate, dashboard, docs — must validate against the single authoritative source (the registry the runner actually reads), enforced by one test that fails CI when any of them diverge. Two safeguards that agree with each other are one safeguard with redundancy theatre. The same audit found a pre-flight check parsing the wrong schema for the halt file, so an active kill-switch halt passed the check as "no halt": your safety checks are code, and untested safety-check code fails exactly when you need it.
War-story: the dormant strategy whose first live act was a crash
A strategy sat in the live bundle but never traded: a warm-up gate it could not satisfy (no historical backfill) kept it short of the bar count its indicators needed, so for roughly a year its on_bar returned early every day. It looked deployed. When the gate was finally fixed and the strategy reached its trading path for the first time, it crashed on the first bar — a method called on a framework type that didn't carry it, a latent AttributeError that every live signal would have hit, sitting undisturbed because no live signal had ever arrived. The validation suite was green throughout; it tested the signal math, not the order path the strategy had never executed.
Consequence. A strategy believed deployed-and-validated whose entire trading path was unproven, primed to fail on activation. The rule it bought: code that never runs is not validated — a strategy that cannot reach its order path is in an unknown state, not a safe one. Treat activation (or un-dormanting) as a deployment event with its own gate, and unit-test the order path directly rather than inferring it from the signal test. A green dashboard is not the same as an exercised code path.
War-story: the close that closed nothing
After a restart, the trading framework reconciled the broker's open positions back into its cache and tagged each one with an internal owner id (a reconciliation placeholder), not the strategy that had opened it. The strategy's exit, halt-flatten, and shutdown paths all called a strategy-scoped "close my positions" helper — which filters by the strategy's own id, found nothing under it, and logged a cheerful "no positions to close" while the inventory rode the market unmanaged. The position was visible everywhere; it just wasn't owned by the code trying to close it. The strategy then saw itself as flat-but-still-long and could neither exit nor re-enter.
Consequence. On every restart with an open position, the kill-switch flatten and the signal exit became silent no-ops — the worst failure mode for a control you are counting on. The rule it bought: claim ownership of reconciled positions explicitly (the framework offered an unused external_order_claims hook that re-tags them to the right strategy on reconciliation), and assert post-restart that your flatten path actually reaches the broker's net position. A close path that filters by owner must be tested against the way the framework labels positions after a restart, not the way your code labelled them before one.
War-story: the guard that reset on every restart
A daily momentum sleeve held each position for a minimum number of bars before it would consider an exit — a hold guard that lived as an in-memory bar counter. On restart the framework correctly re-adopted the open position (the fix from the previous war-story), but seeded that counter to already past the minimum, on the reasoning that a re-adopted position of unknown age should at least be exitable. The reasoning was local and the consequence was global: the system restarted constantly — a watchdog on every wedged data feed, a redeploy on every change — so each restart reset the hold clock to "satisfied." A position entered four bars ago, against a five-bar minimum, became exit-eligible the instant a restart intervened, and the next sub-threshold bar closed it a day early. The guard read as armed in every code review and was a no-op in production — defeated not by a logic error but by the operational reality of uptime.
The deeper version of the same bug lived in a regime-tiered strategy's drawdown circuit breaker. Its state — throttled-or-killed, the consecutive-quiet-bars recovery counter, and the high-water mark the drawdown was measured against — was all in memory, and a fresh process reset it to normal, recovery zero, high-water-mark back to seed capital. A restart while the breaker was throttling exposure in a real drawdown therefore cleared the breaker and blinded it to the loss at once: equity reset to par, drawdown read as zero, and the strategy re-levered straight back into the decline the breaker existed to escape.
Consequence. Two risk controls — a minimum-hold and a drawdown breaker — that looked live and were silently disarmed by the one event that happens most often: a restart. The rule it bought: any guard whose decision depends on elapsed time or a path-dependent counter (minimum-hold, cooldown, re-entry-quiet, circuit-breaker dwell, the high-water mark a breaker measures against) must be persisted and restored across restarts, reconstructed from a durable store rather than re-seeded to a default. An in-memory guard in a system that restarts frequently is not a guard; it is a guard-shaped object that works only as long as the process does. Persist the true state on each bar, restore it on reconciliation, and fall back to the conservative default only when there is genuinely nothing to restore — concretely, on each bar write the guard's state (entry bar, bars held, high-water mark) to a durable store, and read it back during reconciliation before the strategy acts. The persisted halt file and durable hold-state that implement this pattern in Titan live in Layered safety.
For the full treatment, the contract a strategy class must satisfy and the replay harness that certifies the live path against the proven-causal research path, see Live equals research and The strategy-class contract.
Family 5. Selection discipline: the result was cherry-picked¶
The final family is about the choices you make around the backtest: which parameter, which instrument, which control is doing the real work. These are bugs of an honest researcher's optimism rather than a coding mistake, which makes them the hardest to see.
A canonical cell posted a strong Sharpe while a one-step neighbour in the same grid dropped by half: a lucky coordinate, not an edge. The rules it bought (V3.1 + V3.2): pre-commit the selection rule and select the plateau, not the peak, run cheaply before any Monte Carlo. This is the qualitative twin of Beating your own optimiser, the quantitative form of the same concern.
The single-instrument version is the unrecorded search. When a legacy config names one instrument from a class of plausible candidates (trend on this ticker, carry on that pair), the chosen instrument is almost certainly the survivor of a search nobody wrote down, and its in-sample Sharpe overstates the true cross-sectional edge by an unknown order statistic. The rule: always run a multi-instrument robustness panel, because a single named instrument is a confession of selection bias until proven otherwise.
A V3.4 ablation (turning each component off one at a time) caught a "ballast" piece, a defensive switch assumed to be along for the ride, actually carrying the edge: prove which components matter with an ablation rather than assuming, and because its value is in the left tail, judge ballast on p_kill_trip, not a return metric (see tail risk & risk of ruin). The same lens reframed our drawdown breakers: they are failsafes, not primary controls (V3.5). A breaker stacked on a vol-targeted strategy fires rarely and occasionally clips a recovery: useful as a last line, harmful as the main risk system. The primary control is the continuous size haircut; the breaker is the seatbelt, not the steering.
Together this family is the difference between we chose this and this is what survived our suspicion: a pre-committed selection rule (V3.1), a plateau over a peak (V3.2), a recorded instrument search, an ablation to find the load-bearing parts (V3.4), failsafes kept in their place (V3.5), and every dead end documented (V3.6).
The catalogue at a glance¶
The chapter is a set of stories; a catalogue should also be scannable. Here is every war-story above in one table — symptom, family, the rule it bought, and how strongly that rule is enforced today (the prose → API → CI ladder the next section formalises). IDs reuse the A-code / V-era from the register at the top of this chapter where one exists, and take a local C-tag otherwise. Scan the Enforcement column for the work that remains: every prose row is a lesson still waiting to become a gate.
| ID | War-story (symptom) | Family | Rule it bought | Enforcement |
|---|---|---|---|---|
| A1/A2 | signal earned the bar it was looking at — a smooth, fake trend | 1 Look-ahead | lag any series before it multiplies a return | CI gate |
| A3 | uniform-negative IC was a price-only data bug, not a dead edge | 1 Look-ahead | TR vs price-only explicit; test the cheap data hypothesis first | prose + signatures |
| Z1 | full-series z-score let every bar see the future | 1 Look-ahead | only causal / IS-frozen normalisation exists in the API | safe-by-construction |
| A5 | DSR fed a hand-picked 4-cell plateau, not the full sweep | 2 Honesty | N and trial-variance from the full pool; survivors-only self-flags |
prose + API flag |
| A4 | a −0.5 sanctuary year still earned a non-RETIRE tier |
2 Honesty | per-fold OOS + a hard sanctuary / CI_lo veto |
API veto + process |
| V3.6 | dozens of nulls nearly re-run for want of a record | 2 Honesty | document every dead end | process |
| A7 | a scaling overlay post-multiplied onto the P&L | 3 Fragility | wire the overlay into the engine so it changes the trades | prose |
| A6 | tail gate resampled the strategy's own realised returns | 3 Fragility | resample the underlying and re-run the strategy | prose |
| A9 | strategy guide described a config that wasn't running | 4 Parity | guide matches deployed config, verified | prose (→CI) |
| A8 | "validated against X" with no artifact behind it | 4 Parity | artifact + reproducible test, or the validation doesn't exist | prose (→CI) |
| A10/A11 | a leveraged instrument sized as if it weren't | 4 Parity | parity end-to-end to integer units; tier-scale native leverage | parity harness |
| C1 | two safeguards validated each other, both months stale | 4 Parity | every derived view checks the authoritative source; one CI test | prose (→CI) |
| C2 | a dormant strategy crashed on its first-ever live bar | 4 Parity | code that never runs is unvalidated; gate activation; test the order path | prose |
| C3 | strategy-scoped flatten no-oped on reconciled positions | 4 Parity | claim ownership of reconciled positions; assert flatten reaches the broker | prose |
| C4 | in-memory min-hold and breaker reset on every restart | 4 Parity | persist & restore any time- or path-dependent guard state | prose |
| V3.1/V3.2 | a canonical cell beat its one-step grid neighbour | 5 Selection | pre-commit the rule; pick the plateau, not the peak | process |
| C5 | a single named instrument = an unrecorded search | 5 Selection | always run a multi-instrument robustness panel | process |
| V3.4 | a "ballast" component was secretly carrying the edge | 5 Selection | ablate to find the load-bearing parts; judge ballast on p_kill |
process |
The shape of the work is the right-hand column: four CI gate / API rows are closed (the bug cannot recur), and a dozen prose rows are open — known, named, but still relying on someone remembering. That gradient is the chapter's real thesis, made literal in the next section.
How the catalogue becomes a gate¶
A war-story you only tell at the bar changes nothing. The point of cataloguing each failure is to convert it into something that can't recur silently. The conversions take three forms, in order of strength:
| Form | Strength | Example |
|---|---|---|
| A prose lesson with a mechanism | weak: relies on memory | "remember the carry premium is yield × time-in-market / vol, not yield / vol" |
| A safe-by-construction API | strong: the wrong call doesn't exist | no full-series z-score; periods_per_year has no default |
| An automated CI gate | strongest: the bug fails the build | AST check rejecting sqrt(252); value-corruption causality test |
The honest assessment, delivered by an external auditor, is that most lessons in a young program are still prose, the weakest form — and that claim is itself backed by a number, not a hand-wave: of the war-stories in the catalogue table above, four are closed by a safe-by-construction API or a CI gate (the bug cannot recur) and roughly a dozen still live only as prose or process. The work of maturing a stack is dragging each catalogued failure up that table: from "we know not to do that" to "you literally cannot do that." Every bug here that became an absent API call or a CI gate cannot reappear; every one still living as a paragraph in a directive is waiting to be rediscovered by whoever didn't read it.
The discipline is not free, and you cannot build it all at once
Cataloguing every failure as a gate makes the framework strict, and strict gates have a cost the war-stories above understate. Some of them will veto a marginal-but-real edge: a full-screener-pool DSR N deflates an honest result alongside the lucky ones, a hard relative-MaxDD veto can bin a strategy whose tail is ugly but survivable, and the sanctuary cap will RETIRE something that genuinely earned its keep in every other year. That asymmetry is deliberate — a false retirement costs opportunity, a false deployment costs capital — but pretending the gates only ever catch fakes is its own optimism. And the persistence machinery from the parity family is real engineering work, not a one-liner: every path-dependent guard needs a durable store, a restore-on-reconciliation path, and a test that the restore actually fires. A solo builder cannot stand all of this up at once, so adopt it in order of leverage: shift/causality and parity first (the bugs that make the number outright fiction), then deflation and the tail gates (the ones that decide whether a real number is durable and survivable), then the persistence and ownership machinery (the ops failures that only bite once you are live). Each layer earns the next.
Exercises¶
- Why a catalogue, not a checklist. Why does the book keep a catalogue of failure families rather than a tidy list of best practices? ??? success "Answer" Best practices are advice; a catalogued failure mode is a gate. The same bug class recurs forever if you only fix instances — a researcher who never saw the same-bar leak writes it again next quarter in different pandas. And note who catches them: mostly not the author, because you don't find your own look-ahead by re-reading the code that contains it.
- The bounded leak. A same-bar leak inflated one strategy a lot and another barely. What explains the difference, and what's the lesson? ??? success "Answer" Position persistence: a hold-until-flip strategy has few transition bars, so the leak is small; a fresh-decision-every-bar strategy collapses when fixed. Lesson: fix the leak regardless, but quantify the bias — its size tells you whether there was ever an edge underneath.
Takeaways¶
- Bugs cluster into five families: look-ahead (fiction), validation honesty (luck), tail/cost realism (fragile), deployment parity (not what runs), selection discipline (cherry-picked). Diagnose by family, not by symptom.
- Every error here flattered the strategy. That asymmetry is why suspicion beats celebration: optimistic versions are the ones that survive your attention and reach production.
- You don't catch your own leak by re-reading your code; you catch it with an independent path that disagrees. Internal re-derivations and external adversarial audits found most of these, not the authors.
- Parity is where research meets capital. A guide that doesn't match the config, a "validation" with no artifact, a leveraged instrument sized as if it weren't: these lose money even when the research was perfect.
- Your safeguards are code too. A pre-flight gate that validates against a second copy of the belief it's checking (instead of the authoritative source), a halt-check that parses the wrong schema, a backstop that exists in the runbook but not the deployment — each fails green exactly when relied upon. Test the safety layer; tie every derived view to one source of truth.
- Code that never runs is not validated. A dormant strategy, a never-exercised order path, a close helper never tested against post-restart position ownership: "green dashboard" is not "exercised path." Activation is a deployment event.
- A guard that lives only in memory dies on the restart. Minimum-holds, cooldowns, and drawdown-breaker state must be persisted and restored, or the most frequent event in your system silently disarms them. Reconstruct the true elapsed state from a durable store; never re-seed it to a default.
- A catalogued failure is only as good as its enforcement. Prose relies on memory; a safe-by-construction API or a CI gate cannot be skipped. Drag every lesson up that table.
The next part turns the strongest of these rules into machinery: The strategy-class contract defines what a strategy must satisfy to deploy at all, and Live equals research builds the parity harness that makes the deployment-parity family impossible to ship. For the statistical machinery behind Families 2 and 3, see Walk-forward that's actually out-of-sample, Beating your own optimiser, and Tail risk & risk of ruin.