Trading the Breaking

Trading the Breaking

Share this post

Trading the Breaking
Trading the Breaking
[WITH CODE] Evaluation: Validation framework
Alpha Lab

[WITH CODE] Evaluation: Validation framework

Third layer for multi‑testing strategy validation

πš€πšžπšŠπš—πš π™±πšŽπšŒπš”πš–πšŠπš—'s avatar
πš€πšžπšŠπš—πš π™±πšŽπšŒπš”πš–πšŠπš—
Jul 21, 2025
βˆ™ Paid
5

Share this post

Trading the Breaking
Trading the Breaking
[WITH CODE] Evaluation: Validation framework
1
Share

Table of contents:

  1. Introduction.

  2. Risks and limitations of these techniques.

  3. The Null Hypothesis framework in strategy validation.

  4. Market state-conditioning.

  5. The Timing Control Group Method.

  6. The stationary Block Bootstrap.

  7. The Stratified Synthetic data test.

  8. Quantifying alpha with conditional superiority.

  9. The multi-test validation framework.


Introduction

Look, anyone can show you a backtest with a nice Sharp. I've seen a thousand of them. The question I always ask is simple: Is this real, or did you just get lucky? Did you find a genuine edge, or did you just curve-fit the hell out of the last ten years of data? A pretty equity curve from the past tells you what happened. It tells you nothing about whether it will keep happening.

So, we stop admiring the track record and start trying to break it. We’re not here to confirm our genius; we’re here to find the holes before the market does it for us, with real money. I run every single promising idea through a gauntlet. Think of it as a three-stage interrogation.

First, we break the market down by regime. Markets aren't one continuous story. They have chapters: quiet, trending periods; vicious, choppy periods; full-blown panic. A strategy that only makes money when vol is low and everything is grinding up is a time bomb. It’s useless. I need to see the P&L stratified by volatility quintiles. If the strategy can't make moneyβ€”or at least not lose its shirtβ€”in different environments, it’s fragile. It means the "alpha" is just a symptom of a specific market condition that won't last.

Second, we create fake histories to see if we can break it. The past only happened one way, but it could have been different. We need to know if the strategy was skillful or just fortunate to be born on a friendly path.

  1. We test timing luck. We take your exact same trades but shuffle the entry days. If the performance disappears, it means your logic wasn't the source of profit; you just happened to place bets on lucky days.

  2. We test path dependency. We use a block bootstrapβ€”stitching together random chunks of real market historyβ€”a week here, a month thereβ€”to create a new, plausible timeline. This preserves the market's statistical DNA, like vol clustering, but scrambles the specific sequence of events. If your strategy dies here, it means it was too dependent on the specific path history took, like needing a big drop in March followed by a rally in April. That’s not a robust edge.

  3. Finally, we generate entirely new market data from scratch, based on the statistical properties of the regimes we identified earlier. This is the real acid test. We're asking if the strategy understands the underlying rules of the market, not just the one game we happened to witness.

Third, we redefine what passing means. A p-value that says your results aren't random is the lowest possible bar. I couldn’t care less. The real question is: is it meaningfully better than luck?

After running thousands of those synthetic and shuffled histories, we get a distribution of what random luck looks like. I don't care if your strategy beats the average random outcome. I want to know if it consistently beats the top 5% or top 10% of the lucky monkeys. I call this a Conditional Superiority metric. If your alpha can be easily replicated by one of the luckier random processes, it's not alpha. It's an illusion, and we're not allocating capital to illusions.

Putting it all together, this isn't a checklist. It’s a process of elimination. Our job isn't to prove a strategy works; it's to try and kill it from every conceivable angle. The computational cost is non-trivial, but it's pocket change compared to the cost of deploying a beautifully backtested, overfit model that blows up six months later.

The few ideas that survive this gauntletβ€”the ones that show they're not just lucky timing, aren't brittle to the path history took, and still generate an edge against the very best random outcomesβ€”those are the ones we can start talking about.

If you want to understand this approach in more depth, I recommend taking a look at what Randomized Controlled Trials are:

Understanding And Misunderstanding Randomized Controlled Trials
999KB βˆ™ PDF file
Download
Download

In quantitative research, you can apply Randomized Controlled Trials by treating each new signal, factor or algorithm as the β€œtreatment” and compare it against a controlβ€” typically your existing benchmark model or a naΓ―ve allocation rule. You’d randomly assign subsets of securities, time periods or capital to either the new model or to the control, ensuring that any market regime effects or structural biases are, on average, balanced across both groups.

To reduce analyst bias, you can blind yourself to which allocation is which by having someone elseβ€”or an automated wrapperβ€”label datasets only as Groupβ€―A and Groupβ€―B, then only unmask after performance metricsβ€”such as Sharpe ratios, maximum drawdowns and hit ratesβ€”have been computed on held‑out data. This way, any statistical testβ€”e.g., paired t‑tests on daily PnL differencesβ€”truly reflects the edge of your new factor rather than inadvertent overfitting or selection bias.

Sound familiar right!? Anything less is just noise.

Risks and limitations of these techniques

While the multi-test framework presented here is a defense against deploying overfit and fragile strategies, it is not a silver bullet. The very tools we use to test for randomness and luck have their own assumptions, parameters, and limitations. A critical quant must be aware of these weaknesses to interpret the results wisely and avoid a false sense of security. Ignoring them is merely trading one form of statistical naivety for another.

  1. Model-dependence and parameter sensitivity:

    The entire validation framework is built upon models, and these models have parameters that the researcher must choose. These choices are subjective and can significantly influence the outcome of the tests.

    1. Regime Definition: The market state-conditioning, which underpins the TCG and Synthetic Data tests, depends on the arbitrary choice of a window for calculating rolling volatility and the number of n_regimes. A strategy might appear robust when tested against 5 regimes defined by a 20-day volatility window but fragile when tested against 3 regimes on a 60-day window. This introduces a risk of meta-overfitting, where one might unconsciously select the validation parameters that make their strategy look best.

    2. Block Bootstrap parameterization: The Stationary Block Bootstrap's effectiveness hinges on the mean block length. If the blocks are too short, the test will destroy the very autocorrelation it's meant to preserve, unfairly penalizing strategies that rely on short-term momentum. If the blocks are too long, the bootstrapped series will be too similar to the original, making the test trivially easy to pass. The optimal block size is data-dependent and not known a priori.

    3. The Synthetic Data "Difficulty Knob": The correlation_factor (Ξ±) in the Stratified Synthetic Data test directly controls the test's difficulty. A value close to 1.0 creates synthetic histories that are barely different from the original, making the test weak. A value close to 0.0 creates paths that might be unrealistically random, potentially causing a genuinely good strategy to fail. The choice of Ξ± is a trade-off between realism and stress-testing, with no single correct answer.

  2. The "stationarity within regimes" assumption:

    A foundational assumption, particularly for the Stratified Synthetic Data test, is that the statistical properties of returns are stationary within a given regime. We calculate a single mean and standard deviation for all data points labeled Regime 3 and assume they all come from the same distribution, N(ΞΌ3​,Οƒ32​).

    This is a simplification, but it is incorrect. A high volatility period is not a monolith. It contains its own internal dynamics, trends, and changing correlations that are not captured by a simple normal distribution. By generating synthetic data this way, we might be testing for robustness against a simplified, cartoon version of the market, potentially missing how the strategy would react to more complex patterns.

  3. Incomplete Null Hypotheses:

    Each test is only as good as the null hypothesis (H0​) it is designed to challenge. These null hypotheses are, by definition, simplified models of luck and may not cover all plausible sources of random success.

    1. TCG's H0​: It assumes all valid start times within a regime are equally plausible alternatives. This might not hold. There could be subtle, systematic reasonsβ€”intraday patterns, proximity to news eventsβ€”not captured by our regime definition that make certain moments within a regime far better entry points than others.

    2. Block Bootstrap's H0​: It preserves short-term dependencies but explicitly breaks long-term ones. A strategy designed to exploit multi-quarter or annual cycles would be systematically and unfairly invalidated by this test. Furthermore, it only ever rearranges the past; it cannot create novel market dynamics that a truly robust strategy should survive.

    3. Synthetic Data's H0​: It assumes the specified regime-based model is a complete description of market behavior. It cannot generate black swan events or structural breaks that fall outside the historical distribution of regime parameters. A strategy could pass this test with flying colors and still be wiped out by a true market paradigm shift.

  4. Interpretation risk and the search for thresholds:

    The framework produces a dashboard of p-values and Conditional Superiority scores, but it does not provide an infallible rule for making a decision.

    1. Subjective thresholds: What is a good result? Is a p-value of 0.11 acceptable? Is a combined PCS​(0.8) of 67% good enough to risk capital? The thresholds for rejecting a null hypothesis or deeming a strategy superior are ultimately subjectiveβ€”I set it from default as PCS​(0.95) of 98%.

    2. Ignoring economic rationale: A strategy might pass every statistical test but be based on a spurious correlation or a nonsensical economic premise. Statistical robustness is a necessary, but not sufficient, condition. Without a sound underlying theory for why the strategy should work, its statistical success remains suspect and vulnerable to failure when market conditions inevitably change. These tests can tell you if a strategy worked, but they can't tell you why.

This approach must be applied with a healthy dose of skepticism and a deep understanding of their own inherent limitations.

The Null Hypothesis framework in strategy validation

Before we can validate a strategy, we must first define what we are testing against. The common objectiveβ€”to see if the strategy makes moneyβ€”is statistically naive. A more rigorous question is: Does the strategy's performance significantly exceed what could be expected from a plausible null hypothesis?

The null hypothesis, H_0, posits that the observed profits are the result of chance, given the strategy's structural characteristics. Our task is to quantify the evidence against this hypothesis.

The primary tool for this is the p-value. In this context, the p-value represents the probability of observing a performance metricβ€”final Profit and Lossβ€”at least as extreme as the one achieved by the actual strategy, assuming the null hypothesis is true. A low p-valueβ€”typically < 0.05 or < 0.10β€”suggests that the observed performance is unlikely to be a random fluke, allowing us to reject the null hypothesis in favor of the alternative, which is that the strategy possesses genuine predictive skill.

We already went deeper in this article:

Trading the Breaking
[WITH CODE] Backtesting
Read more
19 days ago Β· 21 likes Β· πš€πšžπšŠπš—πš π™±πšŽπšŒπš”πš–πšŠπš—

The calculation itself is straightforward. We generate a large number, N, of synthetic performance histories under a specific null hypothesis. Let the terminal PnL of the actual strategy be Pactual and the set of terminal PnLs from the N simulations be P1, P2, ..., PN. The one-sided p-value is then calculated as:

\(p \;=\; \frac{ \displaystyle\sum_{i=1}^N \mathbb{I}\bigl(P_i > P_{\mathrm{actual}}\bigr) \;+\; \tfrac12 \displaystyle\sum_{i=1}^N \mathbb{I}\bigl(P_i = P_{\mathrm{actual}}\bigr) }{N}\)

where II(β‹…) is the indicator function. The term for equality handles ties by distributing their probability mass, providing a more precise estimate.

In our framework, this is implemented with a simple, reusable function that takes the array of simulated profit paths and the actual final profit.

def _calculate_pvalue(self, paths: np.ndarray, actual_pnl_end: float) -> float:
    """
    Calculates the p-value for the actual performance against simulated paths.

    Args:
        paths: An (N, T) array where N is the number of simulations and T is time.
        actual_pnl_end: The final cumulative profit of the actual strategy.

    Returns:
        The p-value.
    """
    if paths.shape[0] == 0:
        return 1.0 # If no paths were generated, we cannot reject the null.
    
    # Count how many simulated paths ended with a higher PnL.
    greater = np.sum(paths[:, -1] > actual_pnl_end)
    
    # Count how many simulated paths ended with the exact same PnL.
    equal = np.sum(paths[:, -1] == actual_pnl_end)
    
    # Apply the formula.
    return (greater + 0.5 * equal) / len(paths)

The crucial part of this framework is not the calculation itself, but the generation of the synthetic paths. The definition of the null hypothesis dictates how these paths are created. A poorly chosen null hypothesis can lead to misleading p-values.

For instance, assuming returns are independent and identically distributed is a weak null hypothesis, as it ignores known market phenomena like volatility clustering and autocorrelation. The remainder of this article is dedicated to constructing and implementing more realistic and challenging null hypotheses to rigorously test our strategies.

Market state-conditioning

Financial market returns are not stationary; their statistical properties change over time. One of the most prominent features of market data is volatility clustering, where periods of high volatility are followed by more high volatility, and tranquil periods are followed by more tranquility. A sophisticated trading strategy should, implicitly or explicitly, adapt to these changing conditions. Consequently, a robust validation framework must account for them.

We introduce the concept of market regimes, defined here by the prevailing level of market volatility. By stratifying the time series into distinct volatility regimes, we can construct more plausible null hypotheses. For example, a test can be designed to assess whether a trade's profitability was due to skill or merely due to it being placed in a high-volatility environment where large price swings are common.

The first step is to compute a rolling measure of volatility. We use the rolling standard deviation of returns. A numerically stable and efficient method to compute this utilizes convolution.

Rolling volatility calculation

The sample standard deviation over a window of size w is

\(s_w \;=\; \sqrt{\frac{1}{w-1}\sum_{i=1}^w \bigl(x_i - \bar{x}_w\bigr)^{2}}.\)

This can be rewritten in terms of means of powers:

\(s_w \;=\; \sqrt{\frac{w}{w-1}\Bigl(\mathbb{E}[X^2] \;-\;\bigl(\mathbb{E}[X]\bigr)^{2}\Bigr)}.\)

We can compute the rolling mean E[X] and the rolling mean of squares E[X2] efficiently using np.convolve.

def _rolling_std(self, arr: np.ndarray, window: int) -> np.ndarray:
    """
    Calculates the rolling standard deviation using convolution for efficiency.
    This applies Bessel's correction (ddof=1).
    """
    if window < 2:
        return np.zeros_like(arr)
    
    # A flat kernel to compute the moving average.
    kernel = np.ones(window) / window
    
    # Compute the rolling mean of the array and the array of squares.
    mean = np.convolve(arr, kernel, mode='valid')
    mean_sq = np.convolve(arr**2, kernel, mode='valid')
    
    # Calculate the sample variance. The term (window / (window - 1)) is Bessel's correction.
    var = (mean_sq - mean**2) * (window / (window - 1))
    
    # Ensure variance is non-negative due to potential floating point errors.
    std = np.sqrt(np.maximum(var, 0))
    
    # Pad the result to match the original array's length.
    result = np.zeros_like(arr)
    result[window - 1:] = std
    return result

Regime labeling

Once we have the time series of rolling volatility, we can partition it into quantiles to define our regimes. For example, we can define 5 regimes where Regime 0 is the lowest volatility quintile, and Regime 4 is the highest.

This post is for paid subscribers

Already a paid subscriber? Sign in
Β© 2025 Quant Beckman
Privacy βˆ™ Terms βˆ™ Collection notice
Start writingGet the app
Substack is the home for great culture

Share