[WITH CODE] Features: Early feature selection
Eliminate the noise before it eliminates your profits
Table of contents:
Introduction.
Let's start with a little debate.
Why does early feature selection win?
Enhancing stability.
The pitfalls of delayed feature selection.
Introduction
In the quantamental world of systematic investing and algorithmic trading, feature selection is like preparing a gourmet meal from a giant buffet: you must carefully choose only the tastiest ingredients to create a dish that not only looks good on paper but also satisfies your computational appetite.
Today I’ll explain you why picking your pizza features during your exploratory research—not while building your final model—can save you from unnecessary complexity and help your algorithm perform with clarity and speed.
Let's start with a little debate
Feature selection is a critical part of building any predictive model, yet it often gets treated like an afterthought—like adding sprinkles after baking a cake. The debate in trading is not whether to use feature selection, but when to do it. Should you choose your features during the exploratory research phase or let the model decide which features to keep when it’s already built?
In many cases, delaying the selection process is akin to creating a dish with every ingredient in your pantry, only to find out later that most of them just muddle the flavor. In algorithmic trading having too many extraneous features can lead to a model that is overly complicated and computationally expensive.
Let’s first take a closer look at the two primary philosophies of feature selection, comparing the kitchen sink approach with the less is more strategy.
The kitchen sink approach to feature selection involves starting with all available features. It is the quant equivalent of dumping everything from your pantry into a bowl—whether it’s your grandma’s secret cookie recipe or a peculiar ingredient like Mars humidity. After assembling this vast array, you then apply algorithms to sift through the noise and select the features that appear to be useful. Basically this:
Let’s examine a simple code snippet that demonstrates this approach—keep in mind that there are many more ways to do this:
import numpy as np
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression
# Step 1: Generate random data
# X will have 100 features, and y will be the target variable.
# We'll specify that only 5 of these features are actually informative.
X, y = make_regression(n_samples=1000, n_features=100, n_informative=5, noise=0.1, random_state=42)
# Step 2: Create feature names for interpretability
feature_names = [f"Feature_{i}" for i in range(100)]
# Add some fun feature names to simulate real-world scenarios
feature_names[3] = "Price"
feature_names[7] = "Volume"
feature_names[15] = "I would be rich with this one"
feature_names[42] = "100M feature"
feature_names[99] = "Meme stock score"
# Step 3: Perform feature selection using SelectKBest
selector = SelectKBest(score_func=f_regression, k=5)
X_selected = selector.fit_transform(X, y)
# Step 4: Get the selected feature names
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = [feature_names[i] for i in selected_feature_indices]
# Print the selected features
print("Selected features:", selected_feature_names)
# output: Selected features: ['Feature_9', 'Feature_22', 'Feature_36', 'Feature_86', 'Feature_96']
We have generated a dataset with 100 features, where only 5 of them are informative—i.e., they have a direct relationship with the target variable y
. Besides, the noise
parameter adds randomness to the target variable. Once, this is done, we use the SelectKBest
method to select the top k
features based on their scores calculated using the f_regression
scoring function.
The problem with this approach is that not all features contribute equally to the performance of your model. Including too many unnecessary features can lead your model into a labyrinth of complexity, much like trying to find your favorite pizza in a pile of assorted leftovers.
In contrast, the minimalist zen approach is about being selective from the outset. Instead of including every available feature, you focus on those that have a clear, theoretically justified relationship with your target variable. For example, you might decide to predict performance relative to a benchmark—say, the S&P 500—using only a handful of carefully chosen features.
Let’s simplify our modeling process to a binary decision:
Signal = 1: When the selected feature indicates a strong chance of outperforming the benchmark.
Signal = 0: Otherwise.
Mathematically, we might model the signal as follows:
where:
Rf is the return of the feature or the strategy based on that feature.
Rb is the return of the benchmark.
By focusing on the core ingredients that matter, you ensure that your model’s predictions are grounded in reality—much like ensuring that your pizza has just the right amount of cheese and toppings.
With these two philosophies in mind, let’s explore why early feature selection—following the minimalist approach—often leads to more robust and understandable models.
Why does early feature selection win?
When you perform feature selection during the research phase rather than after building your model, you are effectively reducing the complexity of the problem. A model trained on a limited set of well-chosen features is easier to understand, faster to compute, and more likely to deliver stable predictions.
Consider a scenario where you are trying to explain a complex phenomenon with too many variables. It’s a bit like trying to describe a masterpiece by listing every single brushstroke—the overall picture gets lost in the details. In a similar vein, a model with too many features might struggle to grasp the underlying patterns in the data.
A useful mathematical concept in this context is Mutual Information. MI quantifies the amount of information obtained about one random variable through another random variable. In simple terms, a high MI between a feature and the target indicates that the feature is very informative—like a well-cooked pizza with the perfect cheese-to-crust ratio.
The formula for Mutual Information is:
Where:
I(X;Y) measures the shared information between feature X and target Y.
p(x,y) is the joint probability distribution of X and Y, while p(x) and p(y) are the marginal distributions.
A higher MI means that knowing X gives you a lot of information about Y.
Let’s see how this might look in code:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif
# Step 1: Generate random data for classification
# X will have 20 features, and y_benchmark will be the target variable.
# We'll specify that only a few of these features are actually informative.
X, y_benchmark = make_classification(
n_samples=1000,
n_features=20,
n_informative=5,
n_redundant=5,
n_clusters_per_class=1,
random_state=42
)
# Convert X to a DataFrame for better interpretability
feature_names = [f"Feature_{i}" for i in range(20)]
# Add some fun feature names to simulate real-world scenarios
feature_names[3] = "P/E ratio"
feature_names[7] = "Interest rates"
feature_names[11] = "Pizza consumption at HQ"
feature_names[15] = "Moon phase"
feature_names[19] = "Stock color preference"
X = pd.DataFrame(X, columns=feature_names)
# Step 2: Calculate Mutual Information scores
mi_scores = mutual_info_classif(X, y_benchmark, random_state=42)
# Step 3: Identify top features with MI score > 0.1
threshold = 0.1
top_features = X.columns[mi_scores > threshold]
# Print the top features
print("Top features:", list(top_features))
# output: Top features: ['Feature_0', 'Feature_10', 'Pizza consumption at HQ', 'Feature_17', 'Feature_18']
Look at this! 'Pizza consumption at HQ' turns out to be a top feature, you might need to re-examine your data sources! 🤣
The rest looks nice, we select those features with MI scores above a threshold—in this example, 0.1. So by selecting only the features that share a significant amount of information with your target, you remove the unnecessary noise that can cloud your model’s performance. Think of it as creating a dish where every ingredient plays a purposeful role.
Enhancing stability
A model built with research-phase feature selection is like a well-planned road trip: you know your destination, and your route is clear and efficient. In contrast, if you decide which features to use after the model has been constructed, you might end up with a model that behaves unpredictably—similar to taking random detours on a road trip without a map.
To visualize this concept, consider two performance curves:
Early selection: A smooth, stable curve that suggests consistent performance.
Late selection: An erratic curve that resembles a heart rate monitor during a horror movie.
I think you will get an idea by simulating these two scenarios:
The smooth performance curve gives you an idea about why early feature selection results in a stable model, while the erratic performance curve illustrates that delaying feature selection can lead to unpredictability. This visualization reinforces our claim that early pruning of the feature set leads to better and more stable model performance.
With these plots reinforcing the stability advantage, let’s now move on to discussing the pitfalls of delayed feature selection.
The pitfalls of delayed feature selection
As we already know, delaying feature selection until after your model is built is like inviting every ingredient to the party and then trying to kick out the uninvited guests mid-dinner. While some modern algorithms might shrink the influence of irrelevant features, this approach can still lead to a model that is unnecessarily complex.
One popular method for automatic feature selection is LASSO regression. LASSO adds a penalty term to the regression objective, effectively shrinking some of the feature coefficients toward zero. While this can help in reducing the number of active features, it is not a substitute for thoughtful, pre-model selection.
The LASSO objective function is:
Where:
||y−Xβ||2 measures how well the model’s predictions match the actual data.
λ||β||1 is the penalty term, with λ controlling the amount of shrinkage.
The L1 norm (||β||1) promotes sparsity by encouraging some coefficients to be exactly zero.
Let’s look at the corresponding code:
import numpy as np
from sklearn.linear_model import Lasso
import pandas as pd
# Generate random data for X_train (features) and y_train (target)
np.random.seed(42) # For reproducibility
n_samples = 100 # Number of samples
n_features = 5 # Number of features
# Randomly generate feature matrix X_train
X_train = np.random.rand(n_samples, n_features)
# Randomly generate target vector y_train
y_train = np.random.rand(n_samples)
# Convert X_train to a DataFrame for better interpretability (optional)
feature_names = ['Inflation', 'Twitter sentiment', 'Number of Elon Musk memes', 'Unemployment rate', 'Stock market index']
X = pd.DataFrame(X_train, columns=feature_names)
# Set up a Lasso regression model with a moderate regularization parameter
lasso = Lasso(alpha=0.01) # Alpha here is equivalent to lambda
# Fit the model
lasso.fit(X_train, y_train)
# Identify surviving features (non-zero coefficients)
surviving_features = X.columns[lasso.coef_ != 0]
print("Surviving features:", list(surviving_features))
# output: Surviving features: ['Inflation', 'Unemployment rate']
We instantiate a Lasso model with alpha=0.01—the regularization strength. After training, only features with non-zero coefficients are retained.
When feature selection is delayed, your model ends up grappling with a surfeit of inputs, many of which might not have a clear relationship with the target. This situation is reminiscent of a busy restaurant kitchen where too many ingredients clutter the counter, leading to confusion rather than culinary excellence.
Without a pre-selection strategy, you might find that:
Your model’s performance becomes unstable, as it is sensitive to small fluctuations in irrelevant features.
Bias in feature selection → Unstable models.
Feature interaction blindness → Complex interactions.
Overcomplicating the model development process → Complicated and inneficient.
Delayed discovery of key features → Overlook important features.
Model generalization is compromised → Fast loss of predictive power.
This gives rise to two opposing views, leading to completely different results in terms of processes and objectives:
N features with N labels → Multi-feature, multi-target.
Feature with 0 labels → Single feature, no prediction.
After several mistakes I realized that the most logical approach in this environment is to go further than the minimalize approach, that means from N features to 1 feature, and from N labels to 0. So, what are you left with? Basically, rules. These rules are tied to a specific benchmark for a particular algorithm:
Instead of: Features ⟷ Labels.
You use: Rules ⟷ Benchmark.
As you move forward in your research and trading endeavors, remember that every feature you add should have a clear purpose. When in doubt, ask yourself: Does this ingredient truly enhance the flavor of my model? If the answer is no, it might be best to leave it out of the recipe.
Until tomorrow!—may your data always be clean and your models highly accurate! 🚀📈
P.S. You don't mind helping me to improve the newsletter, do you? Let me know which ones you prefer!