9. Beating your own optimiser¶
You ran a parameter sweep. A hundred-and-something combinations of lookback, threshold, and stop. One of them posted a Sharpe of 1.3, and now you want to deploy it. Here is the uncomfortable question this chapter answers: how much of that 1.3 did you earn, and how much did you simply find by looking at enough cells?
The answer is almost never "all of it." Searching a grid is not a neutral act of measurement. Every cell you try is another draw from a distribution, and the maximum of many draws drifts up even when none of the draws has any edge at all. The best cell in a sweep is selected for being lucky: that is the literal definition of "best." So the headline Sharpe you report is biased high by an amount that grows with the number of things you tried, and that bias has a name and a correction: the Deflated Sharpe Ratio (Bailey & López de Prado, 2014).
This chapter is the multiple-testing layer of Part II. A backtest you can trust made a single Sharpe honest: right units, no peeking, error bars. Walk-forward made sure you weren't scoring on data you fit. DSR closes the last optimism leak: the one that comes not from any single backtest being wrong, but from picking the winner out of many right ones.
The principle: the max of N draws is not the mean¶
Suppose every parameter cell in your grid is genuinely worthless: true Sharpe zero, every one. Run the backtest on each and you still get a distribution of estimated Sharpes, because each is computed on a finite, noisy sample. Some land at −0.4, some at +0.5, and, purely from sampling variance, one lands highest. Report that one and you've manufactured a positive number from nothing.
The size of the manufactured number depends on two things, and only two:
- How many cells you tried (
N). More draws, higher max. This is the part everyone underestimates. - How spread-out the cells' Sharpes are (the cross-trial Sharpe standard deviation,
σ_SR). Wider spread, higher max.
Bailey & López de Prado give the expected maximum Sharpe under the null of zero true edge in closed form:
$$ \mathbb{E}[\max SR_N] \approx \sigma_{SR}\Big[(1-\gamma)\,\Phi^{-1}!\big(1-\tfrac{1}{N}\big) + \gamma\,\Phi^{-1}!\big(1-\tfrac{1}{N e}\big)\Big] $$
where γ is the Euler-Mascheroni constant and Φ⁻¹ is the inverse normal CDF. You don't need to love the algebra; you need the shape of it. The bracketed term is a slowly-growing function of N, call it the noise ceiling multiplier. Multiply it by your sweep's actual Sharpe spread σ_SR and you get the Sharpe you'd expect to see from the best cell of a totally dead grid.
Here is that multiplier, computed straight from Titan's deflated_sharpe:
Cells tried N |
Noise-ceiling multiplier | E[max SR] if σ_SR ≈ 0.30 (illustrative) |
|---|---|---|
| 6 | 1.30 | ≈ 0.39 |
| 20 | 1.90 | ≈ 0.57 |
| 120 | 2.59 | ≈ 0.78 |
| 240 | 2.82 | ≈ 0.85 |
| 500 | 3.05 | ≈ 0.92 |
Read the N = 120 row. With a perfectly ordinary cross-trial spread, the expected best-of-grid Sharpe is around 0.8, from variance alone, with zero real edge anywhere in the grid. That is the whole intuition of this chapter in one number. If your 120-cell sweep produced a winner at 0.9, you have found essentially nothing; you've found the noise ceiling with a thin coat of paint. A naive bootstrap CI won't save you here either: it asks "is this one series significant?" and never knows it was the survivor of a beauty contest.
The fifth lie's bigger sibling
The backtest chapter ended on "a Sharpe is an estimate, not a fact" and prescribed a bootstrap confidence interval. DSR is the version of that lesson for a family of estimates. The bootstrap CI controls the error on one number; DSR controls the error on the selection of that number from many. You need both, and that ordering matters: DSR is applied in addition to, never instead of, the CI gate.
DSR is also one of several overfitting controls in the same López de Prado toolkit, and they answer different questions. DSR haircuts a single declared winner into P(true SR > 0) given N trials — the form that maps cleanly onto a per-strategy deploy/no-deploy threshold, which is why Titan gates on it. PBO/CSCV (probability of backtest overfitting, via combinatorially-symmetric cross-validation) instead estimates the probability that your selection procedure overfits, by reshuffling which folds are in- and out-of-sample. The Sharpe haircut reports the same deflation as a shrunk Sharpe rather than a probability. The latter two are complementary diagnostics, not substitutes for the gate.
How DSR deflates¶
The Deflated Sharpe Ratio converts the gap between your observed Sharpe and the noise ceiling into a probability:
DSR = P(true Sharpe > 0), after accounting forNtrials and the return distribution's shape.
Three inputs go in, and getting any of them wrong corrupts the answer:
- The gap.
SR_hat − E[max SR_N]. If your winner doesn't clear the noise ceiling, the gap is negative and the deflated probability collapses toward zero. This is the multiple-testing penalty made concrete. - The trial count
N. Drives the ceiling. Honest counting is the entire game (next section). - The return distribution's skew and kurtosis. The Sharpe estimator's own sampling error is wider for skewed, fat-tailed returns. A strategy whose edge is "win small often, lose big rarely" (negative skew, high kurtosis) has a noisier Sharpe than its point estimate suggests, so the same observed Sharpe earns a lower deflated probability. Titan's implementation reads these moments from the actual return series rather than assuming normality:
from titan.research.framework.dsr import deflated_sharpe, sr_var_from_sweep
# σ_SR² estimated across the FULL sweep, not the survivors (see below).
# all_cell_sharpes is the OOS-Sharpe column the audit harness persists for
# EVERY cell it scored -- survivors AND failures -- precisely so this variance
# is computed over the whole grid. Reconstructing it from the deployed
# survivors alone is the survivors_only trap flagged below.
all_cell_sharpes = sweep_results["oos_sharpe"].to_numpy()
sr_var = sr_var_from_sweep(all_cell_sharpes)
res = deflated_sharpe(
sr_hat=canonical_sharpe, # the cell you actually want to deploy
sr_var_across_trials=sr_var, # spread of Sharpes across all N cells
returns=canonical_oos_returns, # this cell's OOS series -> skew + kurt
n_trials=N, # the HONEST trial count
)
print(res.dsr_prob, res.e_max_sr) # gate: dsr_prob >= 0.95
The result object carries its own diagnostics, e_max_sr, the skew/kurt it used, the variance-stabilised gap z, and a survivors_only flag, so a reviewer can see why a cell passed or failed, not just the verdict. The deployment gate is dsr_prob >= 0.95.
Degenerate inputs fail safe
The implementation refuses to manufacture a pass out of a missing input. With fewer than 2 trials, or a non-positive cross-trial variance, there is nothing to deflate against, so dsr_prob is forced to 0.0 — you cannot clear a DSR gate you never actually ran a sweep for. And below ~30 return observations the skew and kurtosis estimates are too noisy to trust, so the moment-reader falls back to the normal assumption (skew = 0, kurt = 3): a very short sample silently loses the fat-tail penalty. That is one more reason a sparse strategy needs per-trade T (next section) and a hard look before deployment — not a per-bar series long enough to dodge the guard but lying about its cadence.
Report e_max_SR next to every sweep winner
The single most clarifying habit: print the noise ceiling beside the headline number. "Best cell SR = 0.9, e_max_SR = 0.85" tells the whole story at a glance: the winner is barely above what an empty grid would have produced. A reader who only sees "0.9" thinks you have a strategy. A reader who sees both knows you have a coin flip.
One DSR, end to end (illustrative)
The z the result object reports is the gap divided by the estimator's own noise (Bailey–López de Prado's variance-stabilised statistic):
$$ z = \frac{\hat{SR} - \mathbb{E}[\max SR_N]}{\sqrt{\big(1 - \hat\gamma_3\,\hat{SR} + \tfrac{\hat\gamma_4 - 1}{4}\,\hat{SR}^2\big)\,/\,(T-1)}}, \qquad \texttt{dsr_prob} = \Phi(z) $$
where γ₃ is skew, γ₄ kurtosis, and T the number of return observations. Negative skew and high kurtosis inflate the denominator — that is mechanically how "fat tails earn a lower dsr_prob," and it's the one piece of the formula the chapter otherwise leaves as a black box. Walk one to a verdict: a winner with SR_hat = 0.85, drawn from a pool of N = 120 cells whose Sharpes spread at σ_SR = 0.30, has a noise ceiling E[max SR_N] ≈ 0.78 — so the gap is only 0.07: the best of 120 barely clears what an empty grid would have produced. With T = 750 daily bars, skew −0.4, kurtosis 6, the denominator is sqrt((1 + 0.34 + 0.90)/749) ≈ 0.055, so z ≈ 0.07 / 0.055 ≈ 1.28 and dsr_prob = Φ(1.28) ≈ 0.90 — it fails the 0.95 gate. Had the same returns been Gaussian (skew 0, kurt 3), z ≈ 1.64 and dsr_prob ≈ 0.95, a borderline pass: the fat tails are exactly what tip a coin-flip winner from "barely" to "no."
The most subjective input is σ_SR, and it matters most: E[max SR_N] scales roughly linearly in it, so a 2× mis-estimate nearly doubles the ceiling and can flip the verdict on its own. A measured cross-trial spread also conflates genuine dispersion of edges with pure estimation noise, so anchor on a small, stable value (≈0.1–0.4 for nearby grid cells) and don't let a noisy estimate quietly inflate your ceiling.
Counting N honestly is the hard part¶
The formula is mechanical. The judgement is in N, and it is almost always larger than the number you instinctively report.
| What you tried | Naive N |
Honest N |
|---|---|---|
| A 5×4×6 grid, ran once | 120 | 120 |
| "I only swept 6 cells", after 4 earlier exploratory grids you abandoned | 6 | the cumulative total across all the grids you looked at |
| Screened a universe of ~500 instruments, kept the handful that looked good | a handful | ≈ 500 |
| Tweaked the stop "by hand a few times" until it looked right | 1 | every variant your eyes evaluated |
| Searched 30 candidate features, kept the 5 that predicted | 5 | every feature tried, times the cells per feature |
Feature search is the dominant, most-often-omitted N-inflator. A machine-learning or multi-factor workflow that screens dozens of candidate predictors and keeps the few that "worked" has run a selection at least as wide as any parameter grid — usually wider, because each feature is itself swept over lookbacks and transforms — yet feature selection is almost never counted in N. It is the same sin as keeping the screener's survivors: the selection pressure was applied across every feature you evaluated, so they all belong in the pool. A forecast built from selected features (combining forecasts) inherits this inflated N, and deflating against the survivors alone is how an overfit feature set passes a significance test it should fail.
The rule Titan enforces: N is the size of the candidate pool you selected from, not the number of survivors. If a strategy passed through a several-hundred-name screener, N is the whole pool (even if you only carried a handful of names forward) because the selection pressure was applied across every name. Using the survivor count understates the ceiling and inflates dsr_prob. Worse, the variance term has the same trap: estimate σ_SR from the survivors only and you get a too-small spread (survivors are, by construction, the cells that clustered high), which also biases the probability optimistic. The framework lets you do it when full-pool data is unavailable, but it forces you to admit it:
# Only have the survivors' Sharpes? You may pass them, but the result
# is FLAGGED optimistic -- the true ceiling is higher than this.
res = deflated_sharpe(..., survivors_only=True)
assert res.survivors_only # documented as a LOWER BOUND on the penalty
There is one force pushing the other way, and honesty means naming it. The closed form assumes N independent draws, but adjacent grid cells are anything but: a lookback of 19 and 20 trade almost the same bars, so the effective number of independent trials in a smooth sweep is well below the nominal cell count. Feed the nominal N and you therefore over-deflate — the ceiling is too high, the penalty too harsh. We keep nominal N anyway, because a conservative ceiling is the safe error: it can only reject a real edge, never wave through a fake one. But two cases break the symmetry and are worth holding in mind. A feature or instrument screen is far closer to independent than a parameter grid — distinct instruments are not near-duplicates of each other — so count those at full nominal N without apology. And a winner sitting on a broad plateau of correlated neighbours is precisely the regime where nominal N most overstates the true penalty; that is no coincidence, it is why the plateau check and DSR are complementary rather than redundant — one rewards exactly the correlation the other conservatively double-counts.
The deepest version of honest counting is pre-registration: write down the grid and the trial count before you run it, commit it to git, and let the audit read N from that committed manifest. Titan's audit wrapper refuses to run against an uncommitted pre-reg precisely so that N cannot be quietly shrunk after the winner is known. That machinery lives in the sanctuary & decision matrix chapter; here the point is narrower: the trial count is an input to a significance test, so inventing it after the fact is the same sin as p-hacking.
War-story: the screener winner that was indistinguishable from noise
A breakout strategy was run across a large universe (a few hundred instruments) and the handful that survived posted eye-catching annualised Sharpes, some in the high single digits on short, sparse samples. The instinct was to deploy the survivors. Then we ran DSR at the true trial count (the full screener pool, not the survivors) and the picture inverted. Because the per-instrument samples were short and the cross-trial Sharpe spread was large, the noise ceiling e_max_SR came out enormous, well above every survivor's observed Sharpe. Even using the deliberately optimistic survivors-only variance, every surviving cell failed the DSR gate. The "winners" were exactly what a dead universe of that size and sample length produces by chance. Two compounding mistakes had hidden it: scoring sparse-trade strategies on a per-bar Sharpe annualised by a huge factor (which inflated the raw numbers), and never deflating by the real N. The rule it bought: a screener's N is the pool, not the podium; and a sweep with more than a handful of cells is unconfirmed until DSR is run at that N, on top of the bootstrap CI.
Where DSR sits in the gate¶
DSR is not a standalone verdict. In Titan it is one of five axes in the decision matrix, alongside the bootstrap CI lower bound, a Monte-Carlo drawdown test, a held-out sanctuary year, and a noise-robustness check. A cell must clear dsr_prob >= 0.95 to score "best" on its axis; below ~0.5 it scores "worst." No single axis deploys a strategy, but a failing DSR caps the verdict hard.
flowchart LR
A[Sweep: N cells] --> B[Per-cell OOS Sharpe]
B --> C{Bootstrap CI_lo > 0?}
C -- no --> X[unconfirmed]
C -- yes --> D[DSR at honest N<br/>skew + kurt from returns]
D --> E{dsr_prob >= 0.95?}
E -- no --> X
E -- yes --> F[MC + sanctuary + noise axes]
F --> G[5-axis verdict]
The ordering is deliberate and cheap. CI and DSR are seconds of compute; the Monte-Carlo and sanctuary axes are minutes-to-hours. So the sweep gate runs CI_lo, then DSR, first, and a strategy that can't clear the noise ceiling never reaches the expensive machinery. Many decayed published edges die right here, in under a minute, because their best-of-grid cell sits below e_max_SR.
War-story: the plateau that was a single lucky spike
A defensive overlay was being tuned by sweeping its kill-and-re-entry thresholds. One specific cell posted a beautiful drawdown profile: P(MaxDD > 50%) of half a percent in Monte Carlo. The team nearly committed it. But the neighbouring cells told a different story: nudge the kill threshold one step and the same metric jumped to 25 to 30%. The "safe" cell wasn't a robust region of the parameter space; it was a single spike surrounded by cliffs, the textbook signature of fitting noise, not signal. We added a plateau pre-flight that runs before the audit: take the winner and its grid neighbours, and if their relative Sharpe spread is too wide, abort; there is no robust region to deploy. DSR is the quantitative form of this concern (the max of N draws); the plateau check is its structural form (a real edge survives small parameter perturbations, a fitted one doesn't). The rule: a sweep winner must be a plateau, not a peak; and it must clear DSR at the full N before any compute is spent past the gate. A lone lucky spike fails both tests, and it should.
DSR is a selection control, not a quality stamp¶
One caveat, because it is the way DSR gets over-trusted. A high dsr_prob says "this Sharpe is unlikely to be a multiple-testing artifact." It does not say the strategy is deployable, or even good. A signal can clear DSR and still die at the next gate, most commonly when realistic costs eat the edge, or when the held-out sanctuary year diverges from the WFO. DSR also can't see selection it wasn't told about: the abandoned grids, the eyeballed stop tweaks, the four earlier experiments you don't count. It deflates the trials you declare. Declaring them honestly is on you.
And like every number in Part II, the inputs to DSR obey the same measurement rules. A dsr_prob computed on a look-ahead equity curve, or on a per-bar Sharpe that should have been per-trade, is exactly as worthless as the Sharpe that fed it. The affirmative rule at the API: feed deflated_sharpe(returns=...) the series at the cadence the strategy actually trades. For a sparse strategy — a handful of round-trips a year — pass the per-trade return series and set T to the trade count, never a per-bar series annualised by a huge factor, which both inflates sr_hat and lies about T. Use a daily or per-bar series only when the strategy is genuinely active every bar. DSR is a correction on top of a trustworthy Sharpe, never a substitute for one. The full battery of livability metrics (Sortino, Calmar, CVaR/CDaR, and a formal risk of ruin at deployed size) is the subject of the metric suite and position sizing; DSR just makes sure the Sharpe you carry into them was earned, not mined.
Exercises¶
- Count
Nhonestly. You screened 500 instruments, swept a 5×4 grid on the survivors, and hand-tweaked a stop "a few times." A colleague reportsN = 20. What shouldNbe, and why does under-counting flatter the result? ??? success "Answer"Nis the whole candidate pool the selection pressure acted across — the ~500-name screen, times the grid cells, plus every hand-tweak your eyes evaluated — easily in the thousands, not 20. A too-smallNunderstates the expected null maximum Sharpe, so the deflated probability comes out optimistically high. - Plateau vs spike. Two winning cells both post Sharpe 1.5. One's grid neighbours post ~1.4; the other's post ~0.2. Which do you trust, and what does the other one indicate? ??? success "Answer" Trust the plateau (neighbours ~1.4): a real edge is robust to small parameter changes. The lone spike (neighbours ~0.2) is a knife-edge in the noise — a fitted artefact — and should be rejected even at the same point estimate.
Takeaways¶
- Searching a grid inflates your best result. The maximum of
Nnoisy estimates drifts up even when every cell has zero true edge; the more cells, the higher the drift. - Deflate it. The Deflated Sharpe Ratio subtracts the expected null max (driven by
Nand the cross-trial Sharpe spread) and returnsP(true Sharpe > 0). Gate atdsr_prob >= 0.95. Any sweep beyond a handful of cells needs it. - Count
Nhonestly. It's the candidate pool, not the survivors; it includes abandoned grids and hand-tweaks; pre-register it so it can't shrink after the winner is known. Survivors-only variance andNboth bias the probability optimistic; flag them when you must use them. - Use the actual distribution's skew and kurtosis. Fat-tailed, skewed returns make the Sharpe noisier than a normal assumption implies; DSR should penalise them, and Titan's does.
- DSR sits on top of the bootstrap CI, not instead of it: CI controls the error on one estimate, DSR controls the error from selecting it. Run both, cheaply, before any expensive gate.
- A high
dsr_probis a selection clearance, not a deployment stamp. Costs, drawdown paths, and the sanctuary year still get a vote.
DSR closes the multiple-testing leak in a single candidate's evaluation. Two chapters take the rigour further: Tail risk & risk of ruin turns drawdown geometry into a survival probability at deployed size, and the Sanctuary decision matrix shows how DSR becomes one binding axis among five, including the pre-registration discipline that keeps N honest in the first place.