[WITH CODE] Features: Early feature selection

Eliminate the noise before it eliminates your profits

𝚀𝚞𝚊𝚗𝚝 𝙱𝚎𝚌𝚔𝚖𝚊𝚗

Feb 11, 2025

∙ Paid

Table of contents:

Introduction.
Let's start with a little debate.
Why does early feature selection win?
Enhancing stability.
The pitfalls of delayed feature selection.

Introduction

In the quantamental world of systematic investing and algorithmic trading, feature selection is like preparing a gourmet meal from a giant buffet: you must carefully choose only the tastiest ingredients to create a dish that not only looks good on paper but also satisfies your computational appetite.

Today I’ll explain you why picking your pizza features during your exploratory research—not while building your final model—can save you from unnecessary complexity and help your algorithm perform with clarity and speed.

Let's start with a little debate

Feature selection is a critical part of building any predictive model, yet it often gets treated like an afterthought—like adding sprinkles after baking a cake. The debate in trading is not whether to use feature selection, but when to do it. Should you choose your features during the exploratory research phase or let the model decide which features to keep when it’s already built?

In many cases, delaying the selection process is akin to creating a dish with every ingredient in your pantry, only to find out later that most of them just muddle the flavor. In algorithmic trading having too many extraneous features can lead to a model that is overly complicated and computationally expensive.

Let’s first take a closer look at the two primary philosophies of feature selection, comparing the kitchen sink approach with the less is more strategy.

The kitchen sink approach to feature selection involves starting with all available features. It is the quant equivalent of dumping everything from your pantry into a bowl—whether it’s your grandma’s secret cookie recipe or a peculiar ingredient like Mars humidity. After assembling this vast array, you then apply algorithms to sift through the noise and select the features that appear to be useful. Basically this:

Feature Selection: Beyond feature importance? - KDnuggets

Let’s examine a simple code snippet that demonstrates this approach—keep in mind that there are many more ways to do this:

import numpy as np
from sklearn.datasets import make_regression
from sklearn.feature_selection import SelectKBest, f_regression

# Step 1: Generate random data
# X will have 100 features, and y will be the target variable.
# We'll specify that only 5 of these features are actually informative.
X, y = make_regression(n_samples=1000, n_features=100, n_informative=5, noise=0.1, random_state=42)

# Step 2: Create feature names for interpretability
feature_names = [f"Feature_{i}" for i in range(100)]
# Add some fun feature names to simulate real-world scenarios
feature_names[3]  = "Price"
feature_names[7]  = "Volume"
feature_names[15] = "I would be rich with this one"
feature_names[42] = "100M feature"
feature_names[99] = "Meme stock score"

# Step 3: Perform feature selection using SelectKBest
selector = SelectKBest(score_func=f_regression, k=5)
X_selected = selector.fit_transform(X, y)

# Step 4: Get the selected feature names
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = [feature_names[i] for i in selected_feature_indices]

# Print the selected features
print("Selected features:", selected_feature_names)

# output: Selected features: ['Feature_9', 'Feature_22', 'Feature_36', 'Feature_86', 'Feature_96']

We have generated a dataset with 100 features, where only 5 of them are informative—i.e., they have a direct relationship with the target variable y. Besides, the noise parameter adds randomness to the target variable. Once, this is done, we use the SelectKBest method to select the top k features based on their scores calculated using the f_regression scoring function.

The problem with this approach is that not all features contribute equally to the performance of your model. Including too many unnecessary features can lead your model into a labyrinth of complexity, much like trying to find your favorite pizza in a pile of assorted leftovers.

In contrast, the minimalist zen approach is about being selective from the outset. Instead of including every available feature, you focus on those that have a clear, theoretically justified relationship with your target variable. For example, you might decide to predict performance relative to a benchmark—say, the S&P 500—using only a handful of carefully chosen features.

Let’s simplify our modeling process to a binary decision:

Signal = 1: When the selected feature indicates a strong chance of outperforming the benchmark.
Signal = 0: Otherwise.

Mathematically, we might model the signal as follows:

\(\text{Signal} = \begin{cases} 1, & \text{if } R_f > R_b + \text{risk premium}, \\ 0, & \text{otherwise}. \end{cases}\)

where:

R_f is the return of the feature or the strategy based on that feature.
R_b is the return of the benchmark.

By focusing on the core ingredients that matter, you ensure that your model’s predictions are grounded in reality—much like ensuring that your pizza has just the right amount of cheese and toppings.

With these two philosophies in mind, let’s explore why early feature selection—following the minimalist approach—often leads to more robust and understandable models.

Why does early feature selection win?

When you perform feature selection during the research phase rather than after building your model, you are effectively reducing the complexity of the problem. A model trained on a limited set of well-chosen features is easier to understand, faster to compute, and more likely to deliver stable predictions.

Consider a scenario where you are trying to explain a complex phenomenon with too many variables. It’s a bit like trying to describe a masterpiece by listing every single brushstroke—the overall picture gets lost in the details. In a similar vein, a model with too many features might struggle to grasp the underlying patterns in the data.

A useful mathematical concept in this context is Mutual Information. MI quantifies the amount of information obtained about one random variable through another random variable. In simple terms, a high MI between a feature and the target indicates that the feature is very informative—like a well-cooked pizza with the perfect cheese-to-crust ratio.

The formula for Mutual Information is:

\(I(X; Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \log \left( \frac{p(x, y)}{p(x) p(y)} \right) \)

Where:

I(X;Y) measures the shared information between feature X and target Y.
p(x,y) is the joint probability distribution of X and Y, while p(x) and p(y) are the marginal distributions.
A higher MI means that knowing X gives you a lot of information about Y.

Let’s see how this might look in code:

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.feature_selection import mutual_info_classif

# Step 1: Generate random data for classification
# X will have 20 features, and y_benchmark will be the target variable.
# We'll specify that only a few of these features are actually informative.
X, y_benchmark = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=5, 
    n_redundant=5, 
    n_clusters_per_class=1, 
    random_state=42
)

# Convert X to a DataFrame for better interpretability
feature_names = [f"Feature_{i}" for i in range(20)]
# Add some fun feature names to simulate real-world scenarios
feature_names[3] = "P/E ratio"
feature_names[7] = "Interest rates"
feature_names[11] = "Pizza consumption at HQ"
feature_names[15] = "Moon phase"
feature_names[19] = "Stock color preference"

X = pd.DataFrame(X, columns=feature_names)

# Step 2: Calculate Mutual Information scores
mi_scores = mutual_info_classif(X, y_benchmark, random_state=42)

# Step 3: Identify top features with MI score > 0.1
threshold = 0.1
top_features = X.columns[mi_scores > threshold]

# Print the top features
print("Top features:", list(top_features))

# output: Top features: ['Feature_0', 'Feature_10', 'Pizza consumption at HQ', 'Feature_17', 'Feature_18']

Look at this! 'Pizza consumption at HQ' turns out to be a top feature, you might need to re-examine your data sources! 🤣

The rest looks nice, we select those features with MI scores above a threshold—in this example, 0.1. So by selecting only the features that share a significant amount of information with your target, you remove the unnecessary noise that can cloud your model’s performance. Think of it as creating a dish where every ingredient plays a purposeful role.

Enhancing stability

A model built with research-phase feature selection is like a well-planned road trip: you know your destination, and your route is clear and efficient. In contrast, if you decide which features to use after the model has been constructed, you might end up with a model that behaves unpredictably—similar to taking random detours on a road trip without a map.

To visualize this concept, consider two performance curves:

Early selection: A smooth, stable curve that suggests consistent performance.
Late selection: An erratic curve that resembles a heart rate monitor during a horror movie.

I think you will get an idea by simulating these two scenarios:

The smooth performance curve gives you an idea about why early feature selection results in a stable model, while the erratic performance curve illustrates that delaying feature selection can lead to unpredictability. This visualization reinforces our claim that early pruning of the feature set leads to better and more stable model performance.