[WITH CODE] Data: The criteria you need for market data
Choose data that aligns with your trading strategy and time horizon
Table of contents:
Introduction.
The market data supply chain and data quality criteria.
Quantifying data accuracy.
Assessing granularity.
Latency optimization.
Data completeness and integration.
Matrix-based optimization.
Introduction
Let’s face it: most trading strategies fail not because the math was wrong, but because the data was garbage. Imagine spending weeks building a Ferrari of an algorithm, only to fuel it with yesterday’s lawnmower gas. It’ll sputter, stall, and leave you stranded on the highway of regret.
Choosing data that aligns with your trading strategy and time horizon, while meeting quality, granularity, and low-latency criteria, is crucial for optimizing algorithm performance in dynamic markets.
Let us begin in the world of data-driven algorithmic trading!
The market data supply chain and data quality criteria
The journey of market data from its genesis to its utilization can be broken down into several stages. Each stage adds value and transforms raw data into a more refined product ready for consumption by traders and algorithmic systems.
Exchanges: The primary source of market data, where every tick, trade, and order book update is generated. Here, the data is raw and unprocessed—akin to unrefined ore that must be mined for gold.
Hosting providers & ticker plants: These intermediaries collect data from various exchanges and perform normalization. Normalization standardizes different data formats so that downstream systems can process them uniformly. This stage reduces the engineering burden on quants, who otherwise might need to wrangle disparate data formats.
Feed providers: After normalization, feed providers distribute the data via APIs. These feeds supply real-time or near-real-time market data to traders, thus enabling rapid decision-making in high-speed trading environments.
OMS/EMS software providers: Finally, the data is used by Order Management Systems and Execution Management Systems to assist in executing trades and managing orders. At this stage, the data is processed, user-friendly, and often enriched with additional analytics.
The transformation process can be summarized as:
This journey ensures that the data fed into trading algorithms is accurate, timely, and easily digestible.
But what about the quality? Do we need some kind of criteria?
The answer is yes. Indeed the following criteria must be met:
Accuracy: The data must be free of errors. Mathematically, if we let D represent the dataset and ϵ be the error term, we require
\(\epsilon = 0 \quad \text{or at least} \quad |\epsilon| < \delta\)where δ is a small tolerance level.
Granularity: Data should provide sufficient detail for the strategy. For example, a high-frequency trading algorithm requires tick-level granularity. We can express granularity G in terms of the number of data points per unit time:
\(G=\frac{N}{T}\)where N is the number of data points and T is the time period.
Latency: The delay between data generation and reception must be minimized. If L denotes latency, our goal is to achieve L→0.
Field completeness: The dataset must include all relevant fields—price, volume, bid-ask spreads, etc. We define a completeness vector c where each element represents the availability of a required field:
\(\mathbf{c} \in \{0,1\}^k,\)and we require that the sum of elements equals k—the total number of fields.
Integration and compatibility: The data should integrate smoothly with the trading system’s architecture. This can be modeled as a function I(D,S) that measures the integration quality between data D and system S. Our target is to maximize I.
Most retail traders don't have specific criteria. They drift, settling for whatever's available and free. Even many small and medium-sized firms, at best, purchase data from a recognized data provider, and whatever happens... a little cleaning and off they go!
Quantifying data accuracy
Suppose we have a dataset D consisting of nnn observations. Each observation is represented as a vector di∈Rm, where m is the number of features. The overall dataset can be represented by a matrix:
Let E be the error matrix such that:
where Dtrue is the ideal dataset. The Frobenius norm ||E||F gives a measure of the overall error:
A small value of ||E||F indicates high accuracy. In practical terms, our goal is to have:
If your error norm is larger than your true data norm, you might as well be trading based on random guesses!
Let’s see an example of this.
import numpy as np
import matplotlib.pyplot as plt
def compute_error_norm(D_true, D_candidate):
"""
Quantifying Data Accuracy
This snippet computes the Frobenius norm error between a true dataset and candidate datasets
with varying noise levels. In our paper, we described this metric as:
‖E‖₍F₎ = √(Σᵢ Σⱼ (D_candidate - D_true)²)
which is a key indicator of data accuracy.
"""
error_matrix = D_candidate - D_true
return np.linalg.norm(error_matrix, 'fro')
# Simulate a true asset price series (e.g., a linear trend from 100 to 110)
np.random.seed(42)
n_points = 100
true_prices = np.linspace(100, 110, n_points) # true trend
D_true = true_prices.reshape(n_points, 1) # shape: (100, 1)
# Evaluate error for candidate datasets with increasing noise
noise_levels = np.linspace(0, 0.5, 20) # Noise standard deviation from 0 to 0.5
error_norms = []
for noise in noise_levels:
candidate_prices = true_prices + np.random.normal(0, noise, n_points)
D_candidate = candidate_prices.reshape(n_points, 1)
error = compute_error_norm(D_true, D_candidate)
error_norms.append(error)
The output would be:
As the noise level increases, the error norm grows, indicating lower data accuracy. This behavior directly reflects our theoretical derivation that higher noise leads to higher error.
Assessing granularity
We already know about the basic ratio of N observations over a period T. This doesn’t require any complex formula. For an algorithm that operates on tick data, a high value of G is essential. If G is too low, the algorithm may miss important microstructure details of the market. To ensure adequate granularity, we can set a minimum threshold Gmin:
Let’s use a toy example to ilustrate this aspect. Using your infra this will change a little bit but I think that it can help to have a general picture.
import numpy as np
import matplotlib.pyplot as plt
"""
Assessing Granularity
In the context of algorithmic trading, granularity (observations per unit time) is crucial.
This snippet simulates trade tick timestamps using an exponential distribution to mimic high-frequency data.
"""
# Simulate trade tick timestamps
np.random.seed(42)
n_ticks = 500 # number of ticks in a trading session
# Assume interarrival times follow an exponential distribution (mean = 0.2 seconds)
interarrival_times = np.random.exponential(scale=0.2, size=n_ticks)
timestamps = np.cumsum(interarrival_times)
# Calculate granularity as the number of ticks per second
total_duration = timestamps[-1] - timestamps[0]
granularity = n_ticks / total_duration
print("Estimated granularity (ticks per second):", granularity)
You would get an histogram like this one:
Basically, the histogram shows the distribution of time intervals between trade ticks. The average granularity indicates the data resolution, which is key for capturing market microstructure.
Latency optimization
Latency L is a critical metric in algorithmic trading. If we model the data delivery system as a network with delays, we can use the following expression:
where Δti represents the delay at each stage i of the market data supply chain. The objective is to minimize L such that:
This snippet simulates the impact of decreasing latency on an overall quality score. The quality function aggregates inverse error—accuracy—granularity, inverse latency, and completeness as defined previously.
import numpy as np
import matplotlib.pyplot as plt
def compute_quality(D_true, D_candidate, latency, granularity, completeness, weights):
"""
Latency Optimization
This snippet simulates a reduction in data feed latency and evaluates the impact on the overall quality score.
The quality function used here is:
Q = w1*(1/‖E‖₍F₎) + w2*G + w3*(1/L) + w4*C
which highlights the benefit of lower latency.
"""
eps = 1e-6 # small constant to avoid division by zero
error_norm = np.linalg.norm(D_candidate - D_true, 'fro')
q_accuracy = 1 / (error_norm + eps)
q_latency = 1 / (latency + eps)
Q = weights[0]*q_accuracy + weights[1]*granularity + weights[2]*q_latency + weights[3]*completeness
return Q
# Simulate a true asset price series (a slight upward trend)
np.random.seed(42)
n_points = 100
D_true = np.linspace(100, 105, n_points).reshape(n_points, 1)
# Candidate dataset with slight noise
D_candidate = D_true + np.random.normal(0, 0.02, (n_points, 1))
# Set constant granularity and completeness as in our supply chain discussion
granularity = n_points / 1.0 # e.g., 100 ticks per second
completeness = 1.0 # full completeness
# Define weights emphasizing latency (as discussed in our paper)
weights = [0.4, 0.2, 0.3, 0.1]
# Simulate latency reduction (from 50ms to 1ms)
latency_values = np.linspace(0.05, 0.001, 50)
quality_scores = [compute_quality(D_true, D_candidate, lat, granularity, completeness, weights)
for lat in latency_values]
Being the output:
The plot clearly shows that as latency decreases, the quality score improves markedly. The inverse relationship—1/latency—in the quality function underlines the importance of low-latency data feeds.
Data completeness and integration
Let the completeness vector for an observation di be defined as:
where ci,j=1 if field j is present and 0 otherwise. The overall completeness of the dataset is:
A perfect dataset would have C=k. If the integration function I(D,S) is linear with respect to completeness, then we can express:
where α is a scaling factor that depends on system compatibility. Our goal is to maximize I.
This snippet simulates trade record datasets where required fields might be missing. It computes the average completeness ratio and visualizes how it degrades as the probability of missing a field increases.
import numpy as np
import matplotlib.pyplot as plt
import random
def compute_completeness(dataset, required_fields):
"""
Compute the average completeness ratio for a dataset.
"""
total_score = 0.0
for record in dataset:
present = sum(1 for field in required_fields if field in record)
total_score += present / len(required_fields)
return total_score / len(dataset) if dataset else 0
# Define required fields (as per our discussion on data integration)
required_fields = ["price", "volume", "timestamp"]
# Simulate dataset completeness over a range of missing field probabilities
missing_probs = np.linspace(0, 0.9, 10) # from 0 (all fields present) to 0.9 (high missing probability)
completeness_ratios = []
num_records = 200
for p in missing_probs:
dataset = []
for _ in range(num_records):
record = {}
# Each required field is included with probability (1 - p)
if np.random.rand() > p:
record["price"] = round(random.uniform(100, 150), 2)
if np.random.rand() > p:
record["volume"] = random.randint(10, 1000)
if np.random.rand() > p:
record["timestamp"] = random.randint(1600000000, 1700000000)
dataset.append(record)
completeness_ratios.append(compute_completeness(dataset, required_fields))
The output would be:
As the probability of missing a field increases, fewer fields are present in each record on average, so the completeness ratio drops. Essentially, the more likely each field is to be missing, the less complete your overall dataset becomes.
Matrix-based optimization
Okay! Knowing all of the previous things, I now introduce a matrix-based optimization problem to select the best dataset D* from a set of candidate datasets {D1,D2,…,DM}. This is the metric that really matters.
Let Q(D) be a quality function defined as:
where w1,w2,w3,w4 are weights that reflect the relative importance of each criterion. The optimal dataset is then given by:
This optimization problem is a combinatorial selection problem that can be solved using standard optimization techniques such as integer programming or even heuristic methods if the candidate set is very large. And remember, if you ever get stuck, just remember that even the best algorithms sometimes need to “pivot”—but not in a way that involves too many pivots!
import numpy as np
import matplotlib.pyplot as plt
def quality_function(error_norm, granularity, latency, completeness, weights):
"""
Compute the overall quality score.
"""
eps = 1e-6 # to avoid division by zero
q_accuracy = 1 / (error_norm + eps)
q_latency = 1 / (latency + eps)
return weights[0]*q_accuracy + weights[1]*granularity + weights[2]*q_latency + weights[3]*completeness
# Simulate a true asset price series (from 100 to 105 over 100 time points)
np.random.seed(42)
n_points = 100
D_true = np.linspace(100, 105, n_points).reshape(n_points, 1)
# Fixed simulation parameters reflecting our earlier discussion
initial_latency = 0.05 # seconds
granularity = n_points / 1.0 # e.g., 100 ticks per second
completeness = 1.0 # full completeness
weights = [0.4, 0.2, 0.3, 0.1]
# Evaluate candidate datasets with various noise levels
noise_levels = [0.005, 0.01, 0.02, 0.03, 0.04]
quality_scores = []
for noise in noise_levels:
D_candidate = D_true + np.random.normal(0, noise, (n_points, 1))
error_norm = np.linalg.norm(D_candidate - D_true, 'fro')
Q = quality_function(error_norm, granularity, initial_latency, completeness, weights)
quality_scores.append(Q)
The final metric is:
Lower noise levels yield higher quality scores, emphasizing the importance of data accuracy. The quality function aggregates multiple key metrics—accuracy, granularity, latency, completeness—thereby guiding the selection of the best data source for algorithmic trading.
Okay! The results of these simulations have several practical implications:
Trading firms and retail algorithmic traders should prioritize data sources that offer low latency and high granularity, as these attributes directly impact the quality score of the data. The market data supply chain must be scrutinized to identify any bottlenecks that might increase latency.
Efficient data structures and algorithms are critical for trading systems. Selecting the right data structure—e.g., dictionaries for constant-time lookups—can reduce processing delays, thereby improving the overall quality of the PnL.
Integration between data feeds and trading algorithms must be seamless. This means that the system architecture should be designed to handle varying data rates and formats without compromising performance.
I'm sure none of this was taught to you in your Data Science degree, am I right? Alright, that’s a wrap for today! Hope you found this one insightful. Until next time—may your trades ride the waves like a seasoned surfer at sunrise, your strategies slice through uncertainty like a master’s brushstroke, and your returns dance to the rhythm of calculated precision 🪙
PS: How often do you share your approaches or insights with peers?