[WITH CODE] Evaluation: Validation framework
Third layer for multiβtesting strategy validation
Table of contents:
Introduction.
Risks and limitations of these techniques.
The Null Hypothesis framework in strategy validation.
Market state-conditioning.
The Timing Control Group Method.
The stationary Block Bootstrap.
The Stratified Synthetic data test.
Quantifying alpha with conditional superiority.
The multi-test validation framework.
Introduction
Look, anyone can show you a backtest with a nice Sharp. I've seen a thousand of them. The question I always ask is simple: Is this real, or did you just get lucky? Did you find a genuine edge, or did you just curve-fit the hell out of the last ten years of data? A pretty equity curve from the past tells you what happened. It tells you nothing about whether it will keep happening.
So, we stop admiring the track record and start trying to break it. Weβre not here to confirm our genius; weβre here to find the holes before the market does it for us, with real money. I run every single promising idea through a gauntlet. Think of it as a three-stage interrogation.
First, we break the market down by regime. Markets aren't one continuous story. They have chapters: quiet, trending periods; vicious, choppy periods; full-blown panic. A strategy that only makes money when vol is low and everything is grinding up is a time bomb. Itβs useless. I need to see the P&L stratified by volatility quintiles. If the strategy can't make moneyβor at least not lose its shirtβin different environments, itβs fragile. It means the "alpha" is just a symptom of a specific market condition that won't last.
Second, we create fake histories to see if we can break it. The past only happened one way, but it could have been different. We need to know if the strategy was skillful or just fortunate to be born on a friendly path.
We test timing luck. We take your exact same trades but shuffle the entry days. If the performance disappears, it means your logic wasn't the source of profit; you just happened to place bets on lucky days.
We test path dependency. We use a block bootstrapβstitching together random chunks of real market historyβa week here, a month thereβto create a new, plausible timeline. This preserves the market's statistical DNA, like vol clustering, but scrambles the specific sequence of events. If your strategy dies here, it means it was too dependent on the specific path history took, like needing a big drop in March followed by a rally in April. Thatβs not a robust edge.
Finally, we generate entirely new market data from scratch, based on the statistical properties of the regimes we identified earlier. This is the real acid test. We're asking if the strategy understands the underlying rules of the market, not just the one game we happened to witness.
Third, we redefine what passing means. A p-value that says your results aren't random is the lowest possible bar. I couldnβt care less. The real question is: is it meaningfully better than luck?
After running thousands of those synthetic and shuffled histories, we get a distribution of what random luck looks like. I don't care if your strategy beats the average random outcome. I want to know if it consistently beats the top 5% or top 10% of the lucky monkeys. I call this a Conditional Superiority metric. If your alpha can be easily replicated by one of the luckier random processes, it's not alpha. It's an illusion, and we're not allocating capital to illusions.
Putting it all together, this isn't a checklist. Itβs a process of elimination. Our job isn't to prove a strategy works; it's to try and kill it from every conceivable angle. The computational cost is non-trivial, but it's pocket change compared to the cost of deploying a beautifully backtested, overfit model that blows up six months later.
The few ideas that survive this gauntletβthe ones that show they're not just lucky timing, aren't brittle to the path history took, and still generate an edge against the very best random outcomesβthose are the ones we can start talking about.
If you want to understand this approach in more depth, I recommend taking a look at what Randomized Controlled Trials are:
In quantitative research, you can apply Randomized Controlled Trials by treating each new signal, factor or algorithm as the βtreatmentβ and compare it against a controlβ typically your existing benchmark model or a naΓ―ve allocation rule. Youβd randomly assign subsets of securities, time periods or capital to either the new model or to the control, ensuring that any market regime effects or structural biases are, on average, balanced across both groups.
To reduce analyst bias, you can blind yourself to which allocation is which by having someone elseβor an automated wrapperβlabel datasets only as Groupβ―A and Groupβ―B, then only unmask after performance metricsβsuch as Sharpe ratios, maximum drawdowns and hit ratesβhave been computed on heldβout data. This way, any statistical testβe.g., paired tβtests on daily PnL differencesβtruly reflects the edge of your new factor rather than inadvertent overfitting or selection bias.
Sound familiar right!? Anything less is just noise.
Risks and limitations of these techniques
While the multi-test framework presented here is a defense against deploying overfit and fragile strategies, it is not a silver bullet. The very tools we use to test for randomness and luck have their own assumptions, parameters, and limitations. A critical quant must be aware of these weaknesses to interpret the results wisely and avoid a false sense of security. Ignoring them is merely trading one form of statistical naivety for another.
Model-dependence and parameter sensitivity:
The entire validation framework is built upon models, and these models have parameters that the researcher must choose. These choices are subjective and can significantly influence the outcome of the tests.
Regime Definition: The market state-conditioning, which underpins the TCG and Synthetic Data tests, depends on the arbitrary choice of a
window
for calculating rolling volatility and the number ofn_regimes
. A strategy might appear robust when tested against 5 regimes defined by a 20-day volatility window but fragile when tested against 3 regimes on a 60-day window. This introduces a risk of meta-overfitting, where one might unconsciously select the validation parameters that make their strategy look best.Block Bootstrap parameterization: The Stationary Block Bootstrap's effectiveness hinges on the mean block length. If the blocks are too short, the test will destroy the very autocorrelation it's meant to preserve, unfairly penalizing strategies that rely on short-term momentum. If the blocks are too long, the bootstrapped series will be too similar to the original, making the test trivially easy to pass. The optimal block size is data-dependent and not known a priori.
The Synthetic Data "Difficulty Knob": The
correlation_factor
(Ξ±) in the Stratified Synthetic Data test directly controls the test's difficulty. A value close to 1.0 creates synthetic histories that are barely different from the original, making the test weak. A value close to 0.0 creates paths that might be unrealistically random, potentially causing a genuinely good strategy to fail. The choice of Ξ± is a trade-off between realism and stress-testing, with no single correct answer.
The "stationarity within regimes" assumption:
A foundational assumption, particularly for the Stratified Synthetic Data test, is that the statistical properties of returns are stationary within a given regime. We calculate a single mean and standard deviation for all data points labeled Regime 3 and assume they all come from the same distribution, N(ΞΌ3β,Ο32β).
This is a simplification, but it is incorrect. A high volatility period is not a monolith. It contains its own internal dynamics, trends, and changing correlations that are not captured by a simple normal distribution. By generating synthetic data this way, we might be testing for robustness against a simplified, cartoon version of the market, potentially missing how the strategy would react to more complex patterns.
Incomplete Null Hypotheses:
Each test is only as good as the null hypothesis (H0β) it is designed to challenge. These null hypotheses are, by definition, simplified models of luck and may not cover all plausible sources of random success.
TCG's H0β: It assumes all valid start times within a regime are equally plausible alternatives. This might not hold. There could be subtle, systematic reasonsβintraday patterns, proximity to news eventsβnot captured by our regime definition that make certain moments within a regime far better entry points than others.
Block Bootstrap's H0β: It preserves short-term dependencies but explicitly breaks long-term ones. A strategy designed to exploit multi-quarter or annual cycles would be systematically and unfairly invalidated by this test. Furthermore, it only ever rearranges the past; it cannot create novel market dynamics that a truly robust strategy should survive.
Synthetic Data's H0β: It assumes the specified regime-based model is a complete description of market behavior. It cannot generate black swan events or structural breaks that fall outside the historical distribution of regime parameters. A strategy could pass this test with flying colors and still be wiped out by a true market paradigm shift.
Interpretation risk and the search for thresholds:
The framework produces a dashboard of p-values and Conditional Superiority scores, but it does not provide an infallible rule for making a decision.
Subjective thresholds: What is a good result? Is a p-value of 0.11 acceptable? Is a combined PCSβ(0.8) of 67% good enough to risk capital? The thresholds for rejecting a null hypothesis or deeming a strategy superior are ultimately subjectiveβI set it from default as PCSβ(0.95) of 98%.
Ignoring economic rationale: A strategy might pass every statistical test but be based on a spurious correlation or a nonsensical economic premise. Statistical robustness is a necessary, but not sufficient, condition. Without a sound underlying theory for why the strategy should work, its statistical success remains suspect and vulnerable to failure when market conditions inevitably change. These tests can tell you if a strategy worked, but they can't tell you why.
This approach must be applied with a healthy dose of skepticism and a deep understanding of their own inherent limitations.
The Null Hypothesis framework in strategy validation
Before we can validate a strategy, we must first define what we are testing against. The common objectiveβto see if the strategy makes moneyβis statistically naive. A more rigorous question is: Does the strategy's performance significantly exceed what could be expected from a plausible null hypothesis?
The null hypothesis, H_0, posits that the observed profits are the result of chance, given the strategy's structural characteristics. Our task is to quantify the evidence against this hypothesis.
The primary tool for this is the p-value. In this context, the p-value represents the probability of observing a performance metricβfinal Profit and Lossβat least as extreme as the one achieved by the actual strategy, assuming the null hypothesis is true. A low p-valueβtypically < 0.05 or < 0.10βsuggests that the observed performance is unlikely to be a random fluke, allowing us to reject the null hypothesis in favor of the alternative, which is that the strategy possesses genuine predictive skill.
We already went deeper in this article:
The calculation itself is straightforward. We generate a large number, N, of synthetic performance histories under a specific null hypothesis. Let the terminal PnL of the actual strategy be Pactual and the set of terminal PnLs from the N simulations be P1, P2, ..., PN. The one-sided p-value is then calculated as:
where II(β ) is the indicator function. The term for equality handles ties by distributing their probability mass, providing a more precise estimate.
In our framework, this is implemented with a simple, reusable function that takes the array of simulated profit paths and the actual final profit.
def _calculate_pvalue(self, paths: np.ndarray, actual_pnl_end: float) -> float:
"""
Calculates the p-value for the actual performance against simulated paths.
Args:
paths: An (N, T) array where N is the number of simulations and T is time.
actual_pnl_end: The final cumulative profit of the actual strategy.
Returns:
The p-value.
"""
if paths.shape[0] == 0:
return 1.0 # If no paths were generated, we cannot reject the null.
# Count how many simulated paths ended with a higher PnL.
greater = np.sum(paths[:, -1] > actual_pnl_end)
# Count how many simulated paths ended with the exact same PnL.
equal = np.sum(paths[:, -1] == actual_pnl_end)
# Apply the formula.
return (greater + 0.5 * equal) / len(paths)
The crucial part of this framework is not the calculation itself, but the generation of the synthetic paths. The definition of the null hypothesis dictates how these paths are created. A poorly chosen null hypothesis can lead to misleading p-values.
For instance, assuming returns are independent and identically distributed is a weak null hypothesis, as it ignores known market phenomena like volatility clustering and autocorrelation. The remainder of this article is dedicated to constructing and implementing more realistic and challenging null hypotheses to rigorously test our strategies.
Market state-conditioning
Financial market returns are not stationary; their statistical properties change over time. One of the most prominent features of market data is volatility clustering, where periods of high volatility are followed by more high volatility, and tranquil periods are followed by more tranquility. A sophisticated trading strategy should, implicitly or explicitly, adapt to these changing conditions. Consequently, a robust validation framework must account for them.
We introduce the concept of market regimes, defined here by the prevailing level of market volatility. By stratifying the time series into distinct volatility regimes, we can construct more plausible null hypotheses. For example, a test can be designed to assess whether a trade's profitability was due to skill or merely due to it being placed in a high-volatility environment where large price swings are common.
The first step is to compute a rolling measure of volatility. We use the rolling standard deviation of returns. A numerically stable and efficient method to compute this utilizes convolution.
Rolling volatility calculation
The sample standard deviation over a window of size w is
This can be rewritten in terms of means of powers:
We can compute the rolling mean E[X] and the rolling mean of squares E[X2] efficiently using np.convolve
.
def _rolling_std(self, arr: np.ndarray, window: int) -> np.ndarray:
"""
Calculates the rolling standard deviation using convolution for efficiency.
This applies Bessel's correction (ddof=1).
"""
if window < 2:
return np.zeros_like(arr)
# A flat kernel to compute the moving average.
kernel = np.ones(window) / window
# Compute the rolling mean of the array and the array of squares.
mean = np.convolve(arr, kernel, mode='valid')
mean_sq = np.convolve(arr**2, kernel, mode='valid')
# Calculate the sample variance. The term (window / (window - 1)) is Bessel's correction.
var = (mean_sq - mean**2) * (window / (window - 1))
# Ensure variance is non-negative due to potential floating point errors.
std = np.sqrt(np.maximum(var, 0))
# Pad the result to match the original array's length.
result = np.zeros_like(arr)
result[window - 1:] = std
return result
Regime labeling
Once we have the time series of rolling volatility, we can partition it into quantiles to define our regimes. For example, we can define 5 regimes where Regime 0 is the lowest volatility quintile, and Regime 4 is the highest.