[WITH CODE] Data: The criteria you need for market data
Choose data that aligns with your trading strategy and time horizon
Table of contents:
Introduction.
The market data supply chain and data quality criteria.
Quantifying data accuracy.
Assessing granularity.
Latency optimization.
Data completeness and integration.
Matrix-based optimization.
Introduction
Let’s face it: most trading strategies fail not because the math was wrong, but because the data was garbage. Imagine spending weeks building a Ferrari of an algorithm, only to fuel it with yesterday’s lawnmower gas. It’ll sputter, stall, and leave you stranded on the highway of regret.
Choosing data that aligns with your trading strategy and time horizon, while meeting quality, granularity, and low-latency criteria, is crucial for optimizing algorithm performance in dynamic markets.
Let us begin in the world of data-driven algorithmic trading!
The market data supply chain and data quality criteria
The journey of market data from its genesis to its utilization can be broken down into several stages. Each stage adds value and transforms raw data into a more refined product ready for consumption by traders and algorithmic systems.
Exchanges: The primary source of market data, where every tick, trade, and order book update is generated. Here, the data is raw and unprocessed—akin to unrefined ore that must be mined for gold.
Hosting providers & ticker plants: These intermediaries collect data from various exchanges and perform normalization. Normalization standardizes different data formats so that downstream systems can process them uniformly. This stage reduces the engineering burden on quants, who otherwise might need to wrangle disparate data formats.
Feed providers: After normalization, feed providers distribute the data via APIs. These feeds supply real-time or near-real-time market data to traders, thus enabling rapid decision-making in high-speed trading environments.
OMS/EMS software providers: Finally, the data is used by Order Management Systems and Execution Management Systems to assist in executing trades and managing orders. At this stage, the data is processed, user-friendly, and often enriched with additional analytics.
The transformation process can be summarized as:
This journey ensures that the data fed into trading algorithms is accurate, timely, and easily digestible.
But what about the quality? Do we need some kind of criteria?
The answer is yes. Indeed the following criteria must be met:
Accuracy: The data must be free of errors. Mathematically, if we let D represent the dataset and ϵ be the error term, we require
\(\epsilon = 0 \quad \text{or at least} \quad |\epsilon| < \delta\)where δ is a small tolerance level.
Granularity: Data should provide sufficient detail for the strategy. For example, a high-frequency trading algorithm requires tick-level granularity. We can express granularity G in terms of the number of data points per unit time:
\(G=\frac{N}{T}\)where N is the number of data points and T is the time period.
Latency: The delay between data generation and reception must be minimized. If L denotes latency, our goal is to achieve L→0.
Field completeness: The dataset must include all relevant fields—price, volume, bid-ask spreads, etc. We define a completeness vector c where each element represents the availability of a required field:
\(\mathbf{c} \in \{0,1\}^k,\)and we require that the sum of elements equals k—the total number of fields.
Integration and compatibility: The data should integrate smoothly with the trading system’s architecture. This can be modeled as a function I(D,S) that measures the integration quality between data D and system S. Our target is to maximize I.
Most retail traders don't have specific criteria. They drift, settling for whatever's available and free. Even many small and medium-sized firms, at best, purchase data from a recognized data provider, and whatever happens... a little cleaning and off they go!