Trading the Breaking

Trading the Breaking

Research

[WITH CODE] Data: The criteria you need for market data

Choose data that aligns with your trading strategy and time horizon

Mar 20, 2025
∙ Paid

Table of contents:

  1. Introduction.

  2. The market data supply chain and data quality criteria.

  3. Quantifying data accuracy.

  4. Assessing granularity.

  5. Latency optimization.

  6. Data completeness and integration.

  7. Matrix-based optimization.


Before you begin, remember that you have an index with the newsletter content organized by clicking on “Read the newsletter index” in this image.


Introduction

Let’s face it: most trading strategies fail not because the math was wrong, but because the data was garbage. Imagine spending weeks building a Ferrari of an algorithm, only to fuel it with yesterday’s lawnmower gas. It’ll sputter, stall, and leave you stranded on the highway of regret.

Choosing data that aligns with your trading strategy and time horizon, while meeting quality, granularity, and low-latency criteria, is crucial for optimizing algorithm performance in dynamic markets.

Let us begin in the world of data-driven algorithmic trading!

The market data supply chain and data quality criteria

The journey of market data from its genesis to its utilization can be broken down into several stages. Each stage adds value and transforms raw data into a more refined product ready for consumption by traders and algorithmic systems.

  1. Exchanges: The primary source of market data, where every tick, trade, and order book update is generated. Here, the data is raw and unprocessed—akin to unrefined ore that must be mined for gold.

  2. Hosting providers & ticker plants: These intermediaries collect data from various exchanges and perform normalization. Normalization standardizes different data formats so that downstream systems can process them uniformly. This stage reduces the engineering burden on quants, who otherwise might need to wrangle disparate data formats.

  3. Feed providers: After normalization, feed providers distribute the data via APIs. These feeds supply real-time or near-real-time market data to traders, thus enabling rapid decision-making in high-speed trading environments.

  4. OMS/EMS software providers: Finally, the data is used by Order Management Systems and Execution Management Systems to assist in executing trades and managing orders. At this stage, the data is processed, user-friendly, and often enriched with additional analytics.

The transformation process can be summarized as:

\(\text{Raw Data} \rightarrow \text{Normalized data} \rightarrow \text{API feed} \rightarrow \text{Processed for trading systems}\)

This journey ensures that the data fed into trading algorithms is accurate, timely, and easily digestible.

But what about the quality? Do we need some kind of criteria?

The answer is yes. Indeed the following criteria must be met:

  • Accuracy: The data must be free of errors. Mathematically, if we let D represent the dataset and ϵ be the error term, we require

    \(\epsilon = 0 \quad \text{or at least} \quad |\epsilon| < \delta\)

    where δ is a small tolerance level.

  • Granularity: Data should provide sufficient detail for the strategy. For example, a high-frequency trading algorithm requires tick-level granularity. We can express granularity G in terms of the number of data points per unit time:

    \(G=\frac{N}{T}\)

    where N is the number of data points and T is the time period.

  • Latency: The delay between data generation and reception must be minimized. If L denotes latency, our goal is to achieve L→0.

  • Field completeness: The dataset must include all relevant fields—price, volume, bid-ask spreads, etc. We define a completeness vector c where each element represents the availability of a required field:

    \(\mathbf{c} \in \{0,1\}^k,\)

    and we require that the sum of elements equals k—the total number of fields.

  • Integration and compatibility: The data should integrate smoothly with the trading system’s architecture. This can be modeled as a function I(D,S) that measures the integration quality between data D and system S. Our target is to maximize I.

Most retail traders don't have specific criteria. They drift, settling for whatever's available and free. Even many small and medium-sized firms, at best, purchase data from a recognized data provider, and whatever happens... a little cleaning and off they go!

Quantifying data accuracy

Suppose we have a dataset D consisting of nnn observations. Each observation is represented as a vector di∈Rm, where m is the number of features. The overall dataset can be represented by a matrix:

\(\mathbf{D} = \begin{bmatrix} \mathbf{d}_1 \\ \mathbf{d}_2 \\ \vdots \\ \mathbf{d}_n \end{bmatrix} \in \mathbb{R}^{n \times m}.\)

Let E be the error matrix such that:

\(\mathbf{E} = \mathbf{D} - \mathbf{D}_{\text{true}},\)

where Dtrue is the ideal dataset. The Frobenius norm ||E||F gives a measure of the overall error:

\(\mathbf{E}\|_F = \sqrt{\sum_{i=1}^{n} \sum_{j=1}^{m} E_{ij}^2}.\)

A small value of ||E||F indicates high accuracy. In practical terms, our goal is to have:

\(\mathbf{E}\|_F \ll \|\mathbf{D}_{\text{true}}\|_F.\)

If your error norm is larger than your true data norm, you might as well be trading based on random guesses!

Let’s see an example of this.

import numpy as np
import matplotlib.pyplot as plt

def compute_error_norm(D_true, D_candidate):
    """
    Quantifying Data Accuracy

    This snippet computes the Frobenius norm error between a true dataset and candidate datasets 
    with varying noise levels. In our paper, we described this metric as:
        ‖E‖₍F₎ = √(Σᵢ Σⱼ (D_candidate - D_true)²)
    which is a key indicator of data accuracy.
    """
    error_matrix = D_candidate - D_true
    return np.linalg.norm(error_matrix, 'fro')

# Simulate a true asset price series (e.g., a linear trend from 100 to 110)
np.random.seed(42)
n_points = 100
true_prices = np.linspace(100, 110, n_points)  # true trend
D_true = true_prices.reshape(n_points, 1)  # shape: (100, 1)

# Evaluate error for candidate datasets with increasing noise
noise_levels = np.linspace(0, 0.5, 20)  # Noise standard deviation from 0 to 0.5
error_norms = []
for noise in noise_levels:
    candidate_prices = true_prices + np.random.normal(0, noise, n_points)
    D_candidate = candidate_prices.reshape(n_points, 1)
    error = compute_error_norm(D_true, D_candidate)
    error_norms.append(error)

The output would be:

As the noise level increases, the error norm grows, indicating lower data accuracy. This behavior directly reflects our theoretical derivation that higher noise leads to higher error.

Assessing granularity

We already know about the basic ratio of N observations over a period T. This doesn’t require any complex formula. For an algorithm that operates on tick data, a high value of G is essential. If G is too low, the algorithm may miss important microstructure details of the market. To ensure adequate granularity, we can set a minimum threshold Gmin​:

\( G \geq G_{\text{min}}.\)

Let’s use a toy example to ilustrate this aspect. Using your infra this will change a little bit but I think that it can help to have a general picture.

import numpy as np
import matplotlib.pyplot as plt

"""
Assessing Granularity

In the context of algorithmic trading, granularity (observations per unit time) is crucial.
This snippet simulates trade tick timestamps using an exponential distribution to mimic high-frequency data.
"""

# Simulate trade tick timestamps
np.random.seed(42)
n_ticks = 500  # number of ticks in a trading session
# Assume interarrival times follow an exponential distribution (mean = 0.2 seconds)
interarrival_times = np.random.exponential(scale=0.2, size=n_ticks)
timestamps = np.cumsum(interarrival_times)

# Calculate granularity as the number of ticks per second
total_duration = timestamps[-1] - timestamps[0]
granularity = n_ticks / total_duration
print("Estimated granularity (ticks per second):", granularity)

You would get an histogram like this one:

Basically, the histogram shows the distribution of time intervals between trade ticks. The average granularity indicates the data resolution, which is key for capturing market microstructure.

Latency optimization

Latency L is a critical metric in algorithmic trading. If we model the data delivery system as a network with delays, we can use the following expression:

\(L = \sum_{i=1}^{p} \Delta t_i,\)

where Δti​ represents the delay at each stage i of the market data supply chain. The objective is to minimize L such that:

\(L \to \min.\)

This snippet simulates the impact of decreasing latency on an overall quality score. The quality function aggregates inverse error—accuracy—granularity, inverse latency, and completeness as defined previously.

import numpy as np
import matplotlib.pyplot as plt

def compute_quality(D_true, D_candidate, latency, granularity, completeness, weights):
    """
    Latency Optimization

    This snippet simulates a reduction in data feed latency and evaluates the impact on the overall quality score.
    The quality function used here is:
        Q = w1*(1/‖E‖₍F₎) + w2*G + w3*(1/L) + w4*C
    which highlights the benefit of lower latency.
    """
    eps = 1e-6  # small constant to avoid division by zero
    error_norm = np.linalg.norm(D_candidate - D_true, 'fro')
    q_accuracy = 1 / (error_norm + eps)
    q_latency = 1 / (latency + eps)
    Q = weights[0]*q_accuracy + weights[1]*granularity + weights[2]*q_latency + weights[3]*completeness
    return Q

# Simulate a true asset price series (a slight upward trend)
np.random.seed(42)
n_points = 100
D_true = np.linspace(100, 105, n_points).reshape(n_points, 1)
# Candidate dataset with slight noise
D_candidate = D_true + np.random.normal(0, 0.02, (n_points, 1))

# Set constant granularity and completeness as in our supply chain discussion
granularity = n_points / 1.0  # e.g., 100 ticks per second
completeness = 1.0           # full completeness

# Define weights emphasizing latency (as discussed in our paper)
weights = [0.4, 0.2, 0.3, 0.1]

# Simulate latency reduction (from 50ms to 1ms)
latency_values = np.linspace(0.05, 0.001, 50)
quality_scores = [compute_quality(D_true, D_candidate, lat, granularity, completeness, weights)
                  for lat in latency_values]

Being the output:

The plot clearly shows that as latency decreases, the quality score improves markedly. The inverse relationship—1/latency—in the quality function underlines the importance of low-latency data feeds.

Data completeness and integration

Let the completeness vector for an observation di​ be defined as:

\(\mathbf{c}_i = \begin{bmatrix} c_{i,1} \\ c_{i,2} \\ \vdots \\ c_{i,k} \end{bmatrix},\)

where ci,j​=1 if field j is present and 0 otherwise. The overall completeness of the dataset is:

\(\frac{1}{n}\sum_{i=1}^{n} \sum_{j=1}^{k} c_{i,j}.\)

A perfect dataset would have C=k. If the integration function I(D,S) is linear with respect to completeness, then we can express:

\(I(D,S) = \alpha C,\)

where α is a scaling factor that depends on system compatibility. Our goal is to maximize I.

This snippet simulates trade record datasets where required fields might be missing. It computes the average completeness ratio and visualizes how it degrades as the probability of missing a field increases.

import numpy as np
import matplotlib.pyplot as plt
import random

def compute_completeness(dataset, required_fields):
    """
    Compute the average completeness ratio for a dataset.
    """
    total_score = 0.0
    for record in dataset:
        present = sum(1 for field in required_fields if field in record)
        total_score += present / len(required_fields)
    return total_score / len(dataset) if dataset else 0

# Define required fields (as per our discussion on data integration)
required_fields = ["price", "volume", "timestamp"]

# Simulate dataset completeness over a range of missing field probabilities
missing_probs = np.linspace(0, 0.9, 10)  # from 0 (all fields present) to 0.9 (high missing probability)
completeness_ratios = []
num_records = 200

for p in missing_probs:
    dataset = []
    for _ in range(num_records):
        record = {}
        # Each required field is included with probability (1 - p)
        if np.random.rand() > p:
            record["price"] = round(random.uniform(100, 150), 2)
        if np.random.rand() > p:
            record["volume"] = random.randint(10, 1000)
        if np.random.rand() > p:
            record["timestamp"] = random.randint(1600000000, 1700000000)
        dataset.append(record)
    completeness_ratios.append(compute_completeness(dataset, required_fields))

The output would be:

As the probability of missing a field increases, fewer fields are present in each record on average, so the completeness ratio drops. Essentially, the more likely each field is to be missing, the less complete your overall dataset becomes.

Matrix-based optimization

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2026 Quant Beckman · Publisher Privacy ∙ Publisher Terms
Substack · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture