6. A backtest you can trust¶

A backtest is a measurement. And like any measurement, it is worthless, worse than worthless, because it's confident, unless you know its units, whether the instrument was wired up correctly, and how big the error bars are. Most backtests fail all three checks silently, and the failure always points the same direction: it makes the strategy look better than it is.

This chapter is about closing those leaks before any of the heavier machinery (walk-forward, deflation, Monte Carlo) even runs. Get this layer wrong and everything downstream inherits the lie. Get it right and the rest of Part II is just turning up the rigour.

We'll go through the five ways a backtest lies to you, the single discipline that fixes each, why each fix works, and how Titan bakes those disciplines into one shared module so they can't be quietly skipped. Then, because Sharpe is only the first number, we'll look at the rest of the suite you actually need.

The five lies¶

#	The lie	What it looks like	The fix
1	Wrong units	A Sharpe computed on hourly bars, annualised as if daily	An explicit `periods_per_year` on every call
2	Survivor math	Dropping flat bars before annualising	Never filter `returns != 0` before a Sharpe
3	Peeking	The signal at bar t uses information from bar t	Shift discipline: trade t's signal on t+1's return
4	Future-normalised	A z-score computed over the whole series	Causal (rolling / expanding) or IS-frozen normalisation
5	No error bars	A single Sharpe number, reported to two decimals	A bootstrap confidence interval, gated on the lower bound

None of these is exotic. Every one of them has shipped to production in real systems, including Titan, and every one is invisible unless you go looking. Let's take them in order, and notice that each lie flatters the strategy. That asymmetry is the whole reason for suspicion: errors in a backtest are not random, they are biased toward optimism, because the optimistic versions are the ones that survive your attention and get deployed.

A sixth lie lives one layer out: the fill

The five lies here all corrupt the measurement of a return series that already exists. There is a sixth that corrupts the series itself: the backtest booked a fill — at a price, a bar, and a cost — the market would never have given you, so every Sharpe, Calmar, DSR and ruin number downstream is measuring an edge that lived in the fills, not the signal. It belongs to execution rather than measurement, so it gets its own treatment in The execution layer — next-bar-open vs same-bar-close, the half-spread you cross each side, stops that fill past the trigger, and a constructed cost model you re-gate the edge against. Keep it in mind as the sixth member of this list: the most expensive lie, because it sits upstream of every honest number you are about to learn to compute.

Lie 1, Units: annualisation is a correctness property, not a convention¶

The Sharpe ratio is a per-period quantity scaled to a year. The scaling factor is the number of bars in a trading year, and it depends entirely on your bar size:

BARS_PER_YEAR = {
    "D":  252,            # daily
    "H4": 252 * 6,        # 6 four-hour bars per 24h FX day
    "H1": 252 * 24,       # = 6048
    "M5": 252 * 24 * 12,  # = 72576
}

Why does the factor matter so much? Because Sharpe scales with the square root of the number of periods. Annualising per-bar Sharpe means multiplying by sqrt(periods_per_year). Get the period count wrong by a factor of 24 (treating hourly data as daily) and your annualised number is off by sqrt(24) ≈ 4.9×. Nobody makes that error in the direction that makes the strategy look worse. The dangerous version is subtle: a pipeline that resamples mid-way and then annualises with the wrong frequency's factor.

War-story: the H1 strategy that was secretly daily

A backtest aggregated hourly signals into a once-a-day position, producing a P&L series that was, in substance, daily. But the Sharpe helper was still being handed the H1 annualisation factor (252 * 24). The strategy reported a Sharpe near 4 and looked like a licence to print money. The real, daily-annualised figure was around 0.8: a fine-but-ordinary strategy. The sqrt(24) ≈ 4.9× phantom came entirely from annualising a daily series as if it were hourly. The fix wasn't cleverness; it was making the frequency impossible to leave implicit.

The positive rule for a mixed-frequency pipeline: annualise with the factor of the P&L's actual decision frequency, not the raw bar frequency. When signals fire on H1 but a position is held for days, the deciding frequency is daily — confirm it by counting distinct position changes (or collapsing the series to one return per held position) and use that bar count, here 252, not 252 * 24.

The fix is a rule, not vigilance:

Make the annualisation factor a required argument

Titan's shared metrics module refuses to compute a Sharpe without being told the frequency. There is no default.

def sharpe(returns, periods_per_year: int) -> float: ...

A missing default looks like a papercut. It's actually the cheapest correctness gate in the system: it forces every caller to state, at the call site, what frequency the P&L is, which means a reviewer can verify it in one glance. The first line of any backtest output should be the bar timeframe. If you can't state it, stop.

Lie 2, Survivor math: don't filter the flat bars¶

A tempting "cleanup" looks innocent:

# WRONG: makes a selective strategy look far better than it is
active = returns[returns != 0.0]
sr = sharpe(active, periods_per_year=252)

A strategy that's in the market only a fraction a of the time has its Sharpe inflated by 1/sqrt(a) when you drop the flat bars and then annualise with the full-year factor. The intuition: removing the zeros leaves the same mean-per-active-bar but a smaller standard deviation per bar (you deleted the calm days), and you still scale by the full year's bar count. A strategy that trades one day in four (a = 0.25) gets a free 2× to its reported Sharpe. The flat days are not noise to be cleaned; they are information about the strategy's selectivity, and the annualisation factor already accounts for them.

If you genuinely want per-trade statistics (and for sparse strategies you often should), that's a different measurement with its own helper (trade_sharpe), fed a per-trade P&L series, not a zero-filtered bar series. The point is to make the choice explicit, never to silently delete inconvenient zeros.

This is the single most common Sharpe inflation

It survives code review because the filter reads as hygiene. Titan's rule: no Sharpe function filters returns != 0 internally, and reviewers reject any caller that does it by hand. If the strategy is sparse, that's a fact the Sharpe should reflect, not one the metric should hide.

Lie 3, Peeking: shift discipline¶

This is the one that has cost the most, across the most systems. The signal you compute at the close of bar t can only be acted on after bar t: so it earns the return from t to t+1, not the return into t. The mental model: a decision and the return it earns must live in disjoint time windows, decision first. Written as code, the only safe pattern is:

# A position decided at the close of t earns t -> t+1:
strat_returns = asset_returns * position.shift(1).fillna(0.0)

The bug, "same-bar collect", looks like this, and it is everywhere in regime and cross-sectional code:

# WRONG: `winner` is decided using close[t]; `ret` is also the return into close[t].
winner = momentum.idxmax(axis=1)          # uses information available only AT close t
strat  = ret.where(columns == winner)     # but `ret` already happened by close t

The position at t was chosen with information that includes bar t itself, then "earns" bar t's return. It is pure look-ahead, and because momentum signals are autocorrelated it manufactures a gorgeous, completely fake equity curve: the worst kind, because it looks plausible. The fix is mechanical: lag the decision before it touches a return.

winner_lag = winner.shift(1).fillna("CASH")   # decide on yesterday's info
strat = ret.where(columns == winner_lag)      # earn today's return

War-story: four leaks in one codebase

One audit of one regime-driven codebase found the same-bar pattern in four separate places, each one a signal series multiplied by a contemporaneous return. None had been caught in review, because each looked like ordinary pandas. Collectively they turned a flat strategy into a stellar one. The rule Titan now enforces: any series that multiplies a return must be .shift(1)'d first unless you can prove the position was knowable strictly before the return's window opened. Treat same-bar position * return as guilty until proven innocent.

The shift knife cuts both ways: don't over-lag

Lie 3 says lag the decision before it earns a return — but "shift everything that touches a return," applied literally, introduces the opposite bug. Double-lagging (shifting a signal that was already lagged, or .shift(1)-ing a position and also filling at next-bar-open) separates the decision and the return by a gap, so the position decided at t earns the return from t+1→t+2: it throws away a bar of real edge and biases the result pessimistic, making you discard strategies that actually work. The invariant is tighter than "shift it": the decision window and the return window must be adjacent and disjoint — back-to-back, no gap, no overlap. One lag, not zero (look-ahead, which flatters) and not two (lost edge, which buries). Both errors are real; only the adjacent-and-disjoint version is correct.

Lie 4, Future-normalised features¶

Standardising a feature is routine, and the routine version is look-ahead by construction:

# WRONG: mean and std are computed over the ENTIRE series, including the future
z = (x - x.mean()) / x.std()

At every historical bar, that z-score "knows" the mean and standard deviation of data that hadn't happened yet. It's a subtle leak because it doesn't touch returns directly; it poisons the feature, and the damage flows downstream into whatever signal the feature feeds. The fix is to normalise causally, using only data available at each point, or, inside a walk-forward, to freeze the normalisation statistics on the in-sample window and apply them unchanged out-of-sample:

z_causal    = rolling_zscore(x, window=252)          # only past data at each bar
z_expanding = expanding_zscore(x, min_periods=252)   # all past data, growing window
z_wfo       = is_frozen_zscore(x, is_end=fold.is_end) # IS stats, applied to OOS

Titan's metrics module deliberately does not offer a full-series z-score. It can't be called by accident because it doesn't exist; the only z-scores available are the causal and IS-frozen ones. That's a recurring pattern in this book: the safest API is one where the dangerous operation is simply absent.

The warmup prefix has to be dropped from both series

A windowed z-score is NaN for its first window (or min_periods) bars — there isn't enough history yet to standardise. That warmup prefix has to be dropped from the signal and the aligned P&L series together. Drop it from one but not the other and you have silently re-introduced a 1-bar offset, which is Lie 3 wearing a different hat, or a length mismatch that pandas will quietly broadcast around. The safe idiom is to compute the z-score, then slice the signal and the returns to their common, non-NaN index before anything multiplies a return.

Feature discipline beyond leakage: what you should be z-scoring

Lie 4 keeps the future out of a feature's normalisation, but two feature pitfalls remain that aren't look-ahead, so the causal check sails right past them. Stationarity vs memory (over-differencing): a model wants stationary inputs, and the reflex is to difference a series until it is — but each difference discards memory, the slow-moving level a forecast often lives on, so over-differencing strips the signal while making the statistics look clean. Difference only as far as a stationarity test demands (fractional differencing keeps more memory at the same stationarity), and z-score that, not the raw level or the over-differenced noise. Collinearity: two near-duplicate features don't add information, they add instability — a regression or tree splits the shared signal between them arbitrarily and each one's "importance" becomes noise. Check feature correlation and either drop the redundant one or orthogonalise the predictors before they reach the model, the same residualisation the forecast-combination layer uses. Both are engineering failures, not leakage, but they corrupt the forecast just as surely — and feature search across many candidates is itself a multiple-testing inflator that must feed the deflated-Sharpe trial count.

Lie 5, No error bars: a Sharpe is an estimate, not a fact¶

A backtest Sharpe of, say, 1.1 is a point estimate from one particular history, one draw from the distribution of histories the world could have produced. Reported alone, to two decimals, it implies a precision it does not have. The honest version is an interval, and the cheapest honest interval is a bootstrap: resample the return series many times, recompute the Sharpe on each resample, and read the 2.5th and 97.5th percentiles.

lo, hi = bootstrap_sharpe_ci(
    returns,
    periods_per_year=252,
    n_resamples=1000,   # more resamples -> tighter percentile estimates
    block_size=block,   # <- see the warning below
    seed=42,            # reproducible: same input, same interval
)

Then the deployment rule is brutally simple and applied everywhere in Titan:

If the 95% lower bound of the Sharpe is ≤ 0, the strategy is unconfirmed and cannot enter a default deployment registry, regardless of how good the point estimate looks.

A point estimate of 1.1 with a 95% interval of [-0.2, 2.4] is not a 1.1-Sharpe strategy; it's a coin flip with a good story. The lower bound is the number that decides capital, because it is the honest answer to "how bad could this plausibly be?"

And because it decides capital, resample it harder than you would the point estimate. The n_resamples=1000 above is fine for the centre of the distribution but light for the 2.5th-percentile tail you actually gate on — that quantile is the noisiest part of any bootstrap. When the lower bound decides deployment, push to 5,000–10,000 resamples so the gate itself isn't a coin flip.

Gate the sleeve to admit it; gate the book to size it

The rule above runs on one return series — but a real book is several sleeves at once, so apply it at both levels. Each sleeve must clear CI_lo > 0 on its own returns to enter the registry, and the aggregate book equity must clear it too. The two checks are not redundant: correlated sleeves can each pass individually yet stack into an aggregate whose lower bound is worse than any single sleeve's, because correlation concentrates the tail rather than diversifying it away. Gate the sleeve to decide what is allowed in; gate the book to decide how much of it you can hold.

War-story: the bootstrap that lied about itself

The naive (IID) bootstrap resamples individual bars independently, which destroys the autocorrelation that trend and carry strategies live on. That narrows the interval and biases the lower bound upward: exactly the optimism you're trying to avoid, applied to the exact number you gate on. An external audit of Titan flagged this: strategies were passing the CI_lo > 0 gate partly because the CI was artificially tight. The fix is a stationary block bootstrap (Politis & Romano, 1994): resample blocks of consecutive bars (geometric-mean length matched to the strategy's autocorrelation) so serial dependence survives the resampling, and the lower bound tells the truth. Pass a block_size; don't accept the IID default for a serially-correlated strategy.

How long a block? Span the decorrelation horizon

Pick the block so one block covers roughly the strategy's decorrelation horizon — order-of-magnitude its holding period, or the lag at which the return autocorrelation first falls below ~0.05. (Illustrative: a daily trend sleeve holding ~2–4 weeks wants a block of ~10–20 trading days.) Too short and serial dependence still leaks out across block boundaries, putting you back where the IID bootstrap left you; too long and you have too few independent blocks to resample, so the interval gets noisy. The stationary bootstrap then randomises the block length around that geometric mean, so the answer isn't brittle to one exact number.

Sharpe is necessary, not sufficient¶

Everything above makes Sharpe trustworthy. But Sharpe answers exactly one question, return per unit of volatility, and it is blind to the questions that actually decide whether you can live with a strategy:

It treats upside and downside volatility identically. A strategy that occasionally spikes up is penalised the same as one that occasionally craters. Sortino fixes this by dividing by downside deviation only.
It says nothing about the path. Two strategies with identical Sharpe can have wildly different worst drawdowns and time-to-recovery. Calmar (CAGR over max drawdown) and the max-drawdown geometry capture what a human actually experiences holding the thing.
It is a whole-distribution average, so it under-weights the tail that ends you. CVaR / CDaR (the average loss in the worst slice, not just a single quantile) and a formal risk of ruin speak to survival, not smoothness.

War-story: the 1.4-Sharpe strategy nobody could hold

A candidate posted a Sharpe around 1.4, past every gate above, yet its drawdown ran deep into the double digits and took over a year to recover: the kind of trough a strategy gets switched off at the bottom of. That is why Calmar lift, not Sharpe lift, is the primary promotion metric; the full telling is in Beyond Sharpe: the metric suite.

Crucially, all of these metrics are subject to the same five lies: wrong units, survivor math, peeking, future-normalisation, and no error bars. A Calmar computed on a look-ahead equity curve is exactly as worthless as a Sharpe. So the disciplines in this chapter are not "Sharpe rules"; they are measurement rules, and they apply to every number you report.

The full battery, Sortino, Calmar, geometric CAGR, CVaR/CDaR, gets its own treatment in Beyond Sharpe: the metric suite, and turning tail risk into a survival probability at deployed size is the subject of Tail risk & risk of ruin.

Why Titan puts all of this in one module¶

Each fix is a one-liner. The reason they hold over hundreds of research scripts is that none of them is reimplemented locally. Every Sharpe, Sortino, Calmar, volatility, z-score, and annualisation in the codebase routes through a single shared metrics module, and the module is written so the wrong thing is hard or impossible:

sharpe(...), ewm_vol(...), calmar(...), sortino(...) require periods_per_year: no default.
No Sharpe filters zeros internally.
Only causal and IS-frozen z-scores exist; the full-series version is absent.
Edge cases (empty, constant, NaN series) return 0.0/NaN rather than raising, so a guardrail never crashes a batch; it just refuses to flatter you.

The alternative, every researcher writing def _sharpe(r): return r.mean()/r.std()*np.sqrt(252) at the top of their notebook, guarantees that the lies reappear, independently, forever. A shared module turns "remember to be careful" into "you literally cannot call it the unsafe way." That trade, a tiny bit of ceremony at the call site for a class of bugs that can't recur, is the whole philosophy of this book in miniature.

What 'stating the timeframe' buys you

Put the bar frequency at the top of every backtest report. It sounds trivial. But the act of writing # P&L frequency: H1 forces you to confirm the annualisation factor (252 * 24), which is also the moment you'd notice a mid-pipeline resample, a zero-filter, or a daily factor on hourly data. One disciplined comment catches three of the five lies at once.

Exercises¶

Spot the lie. A colleague reports a Sharpe of 4.0 on an hourly strategy, annualised with periods_per_year=252, after filtering out the flat bars. Which two of the five lies are present, and roughly how inflated is the number? ??? success "Answer" Lie 1 (wrong units): hourly data annualised as daily inflates Sharpe by sqrt(24) ≈ 4.9×. Lie 2 (survivor math): dropping flat bars inflates by 1/sqrt(active fraction). Both compound and both flatter — the real figure is a small fraction of 4.0.
The sixth lie. The five lies are all about measuring a return series. Name the sixth, which corrupts the series before any metric runs, and one of its faces. ??? success "Answer" The fill: the backtest booked a price/timing/cost the market wouldn't have given you. Faces include next-bar-open vs same-bar-close, the half-spread crossed each side, and stops filling ~1.5R past the trigger — see the execution layer.

Takeaways¶

A backtest is a measurement: it needs units (periods_per_year), causality (shift discipline + causal normalisation), and error bars (a bootstrap CI you gate on).
The lies all point the same way: they flatter the strategy. Assume any un-audited number is inflated until you've checked all five.
Gate on the lower bound, not the point estimate. CI_lo ≤ 0 ⇒ unconfirmed, full stop. And use a serially-aware bootstrap, or the lower bound lies too.
Sharpe is the entry point, not the verdict. Drawdown geometry (Calmar), downside risk (Sortino), and the tail (CVaR/CDaR, risk of ruin) decide whether a strategy is livable, and they obey the same measurement rules.
Centralise the metrics so the unsafe version can't be written. The safest API is one where the dangerous operation simply doesn't exist.

This chapter fixed the measurement. The next chapters harden the experiment: Beyond Sharpe: the metric suite builds out the battery of numbers; Walk-forward that's actually out-of-sample makes sure you're not testing on data you trained on; and Beating your own optimiser corrects for the fact that the more parameters you try, the better your best result looks by pure chance.