Table of contents:
Introduction.
The network paradigm.
Model risks and limitations.
Spectral clustering.
Stochastic block models.
Markov clustering.
Leiden algorithm.
Graph autoencoders.
Contrastive learning.
Before you begin, remember that you have an index with the newsletter content organized by clicking on “Read full story” in this image.
Introduction
Look, here’s the thing—we’ve all been drinking the correlation Kool-Aid for decades, right? It’s elegant, sure. Clean math. Easy to explain to the PMs. But let’s get real: relying on a correlation matrix in today’s markets is like trying to sail a speedboat with an anchor chained to your hull. It’s not just simplistic—it’s dangerously misaligned with how the game actually works.
You think about it: correlations are backward-looking, linear, and static. Markets? They’re non-linear, regime-shifting, and chaotic . When volatility spikes, everything you thought was diversified suddenly moves in lockstep. Why? Because your pairwise analysis missed the latent feedback loops, the common risk factors blowing up across assets. You’re not measuring systemic risk—you’re just ignoring it until it bites you.
We lived this nightmare in 2018 during the volatility singularity. Not a Black Swan—everyone saw it coming. But our models? They were dead weight. Inverse-vol strategies blew up, futures cascaded into liquidation hell, and suddenly your low-correlation portfolio looked like a bunch of lemmings jumping off a cliff. Diversification? Gone. Hedging? Useless. Because the tools we used couldn’t see the dependencies hiding in plain sight.
So what’s the fix? You can’t just tweak the old models. You need to rip the playbook up and start over. There's much more to it than just matrices. There's much more to it than just matrices. Before we look at online clustering models in future articles, I want to take a moment to consider a different approach.
Model assets as nodes in a graph—dynamic, interconnected, with edges that evolve:
Use centrality measures to find the drivers.
Build minimum spanning trees to map contagion pathways.
Apply random matrix theory to filter noise from signal in eigenvalues.
The path forward required a complete abandonment of the flawed correlation matrix in favor of a graph-theoretic representation of the market. We shifted from viewing assets as solitary entities with simple pairwise handshakes to seeing them as nodes in a dynamic, living, and breathing system.
Take a look at this paper, it will serve as an index:
Okay, before we continue, I want to make one thing clear. These tools aren't for generating signals, but rather for contextualizing the market. That said, let's get started!
The network paradigm
A correlation matrix fails because it fundamentally atomizes the market. It presents a world of isolated relationships, ignoring the complex web of influence that truly governs financial dynamics. The relationship between Apple and Microsoft is not a simple, self-contained story. It is deeply modulated by their shared dependencies on the semiconductor industry (e.g., TSMC, NVDA), their competition in the consumer technology space, their collective sensitivity to macroeconomic policy, and the subtle shifts in sentiment flowing through the entire system. A graph, formally denoted as G=(V, E), provides a far richer and more honest framework.
Here, the assets are vertices (V), and their relationships—the edges (E)—can capture a universe of nuance.
The power of this paradigm lies in its expressive flexibility. Edge weights, wij, are no longer confined to the Pearson correlation coefficient. They can be engineered to represent more sophisticated concepts:
Mutual information: A measure from information theory that captures any statistical dependency, linear or non-linear, between two assets' return distributions.
Granger causality: A statistical test to determine if the past returns of one asset can be used to forecast the future returns of another, hinting at directional influence.
Shared factor exposure: The degree to which two assets are exposed to the same latent risk factors derived from models like PCA or deep autoencoders.
This leap from a flat matrix to a structured graph unlocks a new dimension of structural analysis. We can finally move beyond the myopic view of individual connections and begin to analyze the market's grand architecture. The questions we can now ask are vastly more insightful:
Centrality:
Centrality measures identify the most influential nodes in the network, the assets that are systemically important in ways a simple market cap weighting would miss.
Degree centrality: The number of direct connections a node has. An asset with high degree centrality is a local information hub. In trading terms, this might be a bellwether stock like JPMorgan in the financial sector, whose daily movements have a direct and immediate impact on a large number of other financial instruments.
Betweenness centrality: Measures how often a node lies on the shortest path between two other nodes. An asset with high betweenness centrality acts as a crucial bridge, connecting disparate clusters of the market. Think of a major logistics company like UPS; it may not be the largest company in any one sector, but it is fundamentally critical for connecting the manufacturing, retail, and consumer clusters. Its distress can signal a breakdown in the broader economic machine.
Eigenvector centrality: This is a more subtle measure of influence. A node has high eigenvector centrality if it is connected to other nodes that are themselves highly connected. This identifies assets that are central to the most important conversations in the market. Google is a prime example; it is not just highly connected, it is connected to other giants like Apple, Amazon, and Microsoft, making it a core component of the market's structure.
Community structure:
The core of the article is the search for communities—robust clusters of assets that behave as cohesive economic blocs. These are groups of stocks that are densely interconnected internally but only sparsely connected to the rest of the market. Identifying these functional groups, which often defy traditional sector classifications like GICS, is the first step toward building a truly diversified portfolio based on behavior, not labels.
Network typology:
Does the market exhibit small-world properties—where any asset is just a few steps away from any other—or scale-free properties?—where a few hub nodes have a vast number of connections, following a power law. Understanding these deep structural characteristics is critical for modeling how shocks are likely to propagate. A scale-free market is more resilient to random failures but terrifyingly vulnerable to targeted attacks on its central hubs.
The primary obstacle to adopting this paradigm was not technical, but cultural. It demanded that a team fluent in the comfortable world of linear algebra develop a new literacy in the abstract and sometimes counter-intuitive language of graph theory.
Model risks and limitations
To believe that the adoption of graph-theoretic models is a panacea is to swap one form of naivety for another. While our new map is undeniably more truthful than the old, it is also vastly more complex, and complexity is a fertile breeding ground for new and more subtle forms of risk. Acknowledging these perils is not a sign of weakness but the hallmark of a mature quantitative process. The most dangerous model is the one whose limitations are not understood.
Hyperparameters and model instability:
The elegant theories underpinning these models often conceal a messy reality: their performance is highly sensitive to the choice of hyperparameters. This is a crucial and challenging task. For the deep learning models—GAE, ARGA, MVGRL—we must specify learning rates, the dimensions of latent spaces, the number of GCN and fully-connected layers, and the number of training epochs. A slight change in the latent dimension from 32 to 64 can dramatically alter the resulting cluster topology.
This problem is not unique to deep learning. The Markov clustering algorithm's output is governed by the
inflationparameter; a value of 1.5 might yield a few large communities, while 2.5 could fracture the market into dozens of micro-clusters.The optimal set of hyperparameters for today's market data may be suboptimal tomorrow. This creates a risk of overfitting to a specific market regime, leading to models that are brittle and fail to generalize—a far more insidious problem than the straightforward failures of the correlation matrix.
Ground truth:
A fundamental epistemological risk lies in the fact that, in financial markets, there is no such thing as objective ground truth for clusters. The academic datasets used to benchmark these algorithms, come with predefined class labels. This allows for the calculation of clean evaluation metrics like Accuracy, Normalized Mutual Information, and Adjusted Rand Index.
Scalability:
While our experiments were on standard benchmarks, a financial graphs can contain millions of nodes—individual securities, options contracts, derivatives—and billions of edges.
The O(n³) complexity of a full eigendecomposition for Spectral Clustering makes it infeasible for a universe of more than a few thousand assets.
While GNNs are more scalable, training them on massive graphs is a significant computational and engineering feat. It requires specialized hardware and specific data pipelines for constructing and updating the graphs, and considerable expertise in distributed computing.
This computational burden is not just a one-time cost. For these models to be effective, they must be retrained frequently.
Drift:
Perhaps the greatest risk of all is that the market itself refuses to sit still. The graph structures we so carefully identify are transient. The communities that define the market today can and will dissolve tomorrow. A merger, a technological breakthrough, a geopolitical event, or a simple shift in investor sentiment can instantly rewire the network.
All of these models are trained on historical data, and thus they carry an implicit assumption of stationarity. They are excellent at finding the structure in the data they have seen, but they are not clairvoyant. The phenomenon of concept drift—where the statistical properties of the target variable change over time—is the primary antagonist in all of financial modeling.
Spectral clustering
Once the market is mapped as a graph, the immediate imperative is to partition it into its constituent communities. We need to find those dense clusters of assets that move together, breathe together, and, in times of crisis, fall together. Spectral clustering is a way to do this task. Unlike distance-based algorithms like k-means, which flounder in the high-dimensional, non-Euclidean space of financial assets, Spectral clustering operates on the very soul of the graph: its connectivity structure.
The magic of the method comes from the Graph Laplacian, a matrix that is central to graph theory. It is formally defined as L=D−A, where A is the graph's weighted adjacency matrix and D is the diagonal degree matrix containing the sum of weights for each node. The Laplacian is a discrete analogue of the Laplace operator from vector calculus and it encodes information about the graph's structure. Its spectrum—the set of its eigenvalues and their corresponding eigenvectors—reveals the most natural cuts in the network fabric.
The key insights are as follows:
The smallest eigenvalue of a Laplacian is always 0.
The number of zero eigenvalues is equal to the number of connected components in the graph. A market graph should ideally be one single component.
The eigenvector corresponding to the second-smallest eigenvalue, lambda2 often called the Fiedler vector, holds the key to the graph's most fundamental partition. The signs of its elements (+ or -) naturally bisect the graph into two well-separated communities.
To find k clusters, we don't just use the Fiedler vector. We compute the first k eigenvectors corresponding to the k smallest non-zero eigenvalues. We then stack these eigenvectors column-wise to form a new matrix U∈Rnxk, where n is the number of assets. Each row of this matrix is a new, low-dimensional coordinate for an asset—a spectral embedding.
In this transformed space, the messy, high-dimensional clusters become trivially separable, and we can deploy a simple algorithm like k-means to finish the job. This technique effectively uncovers functional asset groupings that transcend traditional, and often arbitrary, GICS sector labels.
The search for these eigenvectors can be framed as an optimization problem via the Min-Max Courant-Fisher theorem. The second-smallest eigenvalue lambda2, for instance, is the solution to:
where d is the eigenvector for lambda1=0 (a vector of ones). The term xTLx has a beautiful interpretation, as revealed by the Rayleigh theorem:
Minimizing this expression means assigning values xi to each node such that the squared differences across connected nodes are small. Intuitively, this forces nodes within a dense cluster to have very similar x values, while nodes in different clusters will have dissimilar values, creating a cut.
Here is the snippet for the workflow.
import numpy as np
from sklearn.cluster import KMeans
from scipy.sparse import csgraph
import matplotlib.pyplot as plt
import seaborn as sns
def spectral_clustering_implementation(returns_matrix, num_clusters=5, threshold=0.5):
"""
Performs Spectral Clustering on an asset returns matrix. This method leverages the
eigenvectors of the graph Laplacian to find meaningful clusters in the data.
Args:
returns_matrix (np.ndarray): A matrix of asset returns where rows are assets
and columns are time steps (e.g., daily returns).
num_clusters (int): The desired number of clusters to partition the assets into.
threshold (float): A value from 0 to 1 to prune weak correlations, creating a
sparser, more meaningful graph. Edges with weights below this
are set to zero.
Returns:
tuple: A tuple containing:
- np.ndarray: An array of cluster labels assigning each asset to a cluster.
- np.ndarray: The low-dimensional spectral embedding of the assets.
- np.ndarray: The eigenvalues of the Laplacian.
"""
print("1. Calculating the asset-asset correlation matrix from returns...")
# np.corrcoef calculates the Pearson correlation coefficient between rows.
# The result is a symmetric matrix where C[i, j] is the correlation between asset i and j.
corr_matrix = np.corrcoef(returns_matrix)
# We use the absolute value of the correlation as a measure of affinity (similarity).
# A strong negative correlation is just as significant a relationship as a strong positive one.
affinity_matrix = np.abs(corr_matrix)
# Prune weak connections to reduce noise and focus on strong relationships.
affinity_matrix[affinity_matrix < threshold] = 0
# The diagonal should be 1, representing an asset's perfect correlation with itself,
# but for graph purposes, self-loops are often handled separately or set to 0.
np.fill_diagonal(affinity_matrix, 0)
print(f"2. Affinity matrix created with shape: {affinity_matrix.shape}")
print("3. Computing the unnormalized Graph Laplacian (L = D - A)...")
# The csgraph.laplacian function from scipy is numerically stable and efficient.
# 'normed=False' specifies the unnormalized Laplacian.
laplacian_matrix = csgraph.laplacian(affinity_matrix, normed=False)
print("4. Performing eigendecomposition to find the spectrum of the Laplacian...")
# We need the eigenvectors corresponding to the smallest eigenvalues.
# np.linalg.eigh is optimized for symmetric matrices like the Laplacian.
eigenvalues, eigenvectors = np.linalg.eigh(laplacian_matrix)
print(f"5. Creating spectral embedding with the first {num_clusters} non-trivial eigenvectors...")
# The first eigenvector (at index 0) corresponds to the smallest eigenvalue (ideally 0 for a
# connected graph) and is a constant vector, providing no clustering information.
# We take the next 'num_clusters' eigenvectors to form our new feature space.
spectral_embedding = eigenvectors[:, 1:num_clusters+1]
print("6. Running K-Means on the low-dimensional spectral embedding...")
# K-Means is highly effective in this new space, as the clusters are well-separated.
kmeans = KMeans(n_clusters=num_clusters, random_state=42, n_init='auto')
clusters = kmeans.fit_predict(spectral_embedding)
print("7. Spectral Clustering complete.")
return clusters, spectral_embedding, eigenvalues
# Main
if __name__ == "__main__":
# Generate a mock returns matrix for 50 assets over 252 days (1 trading year)
np.random.seed(5)
num_assets = 50
num_days = 252
mock_returns = np.random.randn(num_assets, num_days)
# Add some latent structure: make groups of assets correlated to simulate sectors.
mock_returns[0:10, :] += np.random.randn(1, num_days) * 0.5 # Cluster 1
mock_returns[10:25, :] += np.random.randn(1, num_days) * 0.5 # Cluster 2
mock_returns[25:35, :] += np.random.randn(1, num_days) * 0.5 # Cluster 3
# Cluster 4 is the remaining uncorrelated assets
# Perform spectral clustering
asset_clusters, embedding, evals = spectral_clustering_implementation(mock_returns, num_clusters=4, threshold=0.1)
print(f"\nCluster assignments for {num_assets} assets:")
print(asset_clusters)We have generated two different plots:
Eigengap plot: This visualizes the first few eigenvalues of the Laplacian in increasing order. A common heuristic for determining the optimal number of clusters, k, is to look for a large gap between consecutive eigenvalues. The value of k is chosen to be the number of eigenvalues before this gap. This gap signifies that any further partitions would require cutting through a much denser part of the graph, making them less natural.
Spectral embedding plot: This is a scatter plot of the assets in their new coordinate system, defined by the first two meaningful eigenvectors—the Fiedler vector and the 3rd eigenvector. In this new space, the clusters which were tangled and inseparable in the original high-dimensional space now appear as distinct, well-separated clouds of points.
Stochastic block models
Spectral Clustering provides a deterministic snapshot of the market's structure. However, as you already know, the market is anything but deterministic. To capture this inherent randomness and to model the market's structure we turn to Stochastic block models. SBMs take graph partitioning into the realm of statistical inference. Instead of just finding a single best partition, an SBM attempts to learn a probabilistic model of the market's latent community structure itself.
The core assumption of an SBM is that each asset i belongs to one of K hidden communities, represented by a membership vector z. The probability of an edge existing between two assets i and j then depends only on the communities to which they belong, zi and zj. The goal of the inference process is to find the community assignments z and the K × K inter-community probability matrix B that maximize the likelihood of observing the actual market graph A:
where theta=(B, z) are the model parameters. This is an interesting shift. We are no longer just describing the network; we are building a generative model that could, in theory, produce other market graphs with the same underlying community.
This model-based approach allows for:
Statistical inference: We can place confidence intervals on our cluster assignments and the inter-block connection probabilities.
Model selection: Principled methods like Bayesian Information Criterion or cross-validation can be used to determine the optimal number of latent communities, K.
The matrix B provides a concise summary of the entire market's mesoscale structure. We can see which communities are closely allied (high Brs), which are isolated, and which act as bridges.
For financial markets, however, the standard SBM has a crippling flaw. It operates under the assumption of stochastic equivalence, meaning all nodes within the same block are statistically interchangeable in their connection patterns. This is patently false in finance. A community like big tech contains both behemoth hubs like Amazon and smaller, more peripheral players. They are not equivalent.
The essential upgrade is the Degree-Corrected SBM. The DC-SBM introduces an additional parameter thetai for each node, which controls its expected degree. This allows for massive heterogeneity of degrees within a single community. A hub and a peripheral stock can now coexist peacefully in the same cluster, distinguished by their theta values but united by their shared pattern of connections to other communities. This is a far more realistic model of market structure.
Inference for SBMs is typically performed using iterative algorithms like the Expectation-Maximization algorithm or variational inference methods.
Let’s code it:
# !pip install scikit-network
import numpy as np
import matplotlib.pyplot as plt
def compute_likelihood(A, g, K):
"""
Compute the (degree-corrected) SBM log-likelihood (up to a constant)
for adjacency A and assignments g∈{0,…,K-1}.
"""
N = A.shape[0]
# Node degrees
k = A.sum(axis=1)
# Sum of degrees per block
kappa = np.array([k[g == r].sum() for r in range(K)])
# Edge counts between blocks (double-counts for r≠s; diagonal zero-diagonal)
M = np.zeros((K, K))
for r in range(K):
for s in range(K):
M[r, s] = A[np.ix_(g == r, g == s)].sum()
L = 0.0
for r in range(K):
for s in range(r, K):
m_rs = M[r, s]
if m_rs <= 0:
continue
if r == s:
# intra-block term (M_rr is double-counted edges)
L += m_rs * np.log(m_rs / (kappa[r] * kappa[r]))
else:
# inter-block term
L += 2 * m_rs * np.log(m_rs / (kappa[r] * kappa[s]))
return L
def degree_corrected_sbm(A, K, max_iter=100, random_state=42, verbose=False):
"""
Greedy maximize DC-SBM likelihood by re-assigning one node at a time.
"""
N = A.shape[0]
rng = np.random.RandomState(random_state)
# random initial assignment
g = rng.randint(K, size=N)
L = compute_likelihood(A, g, K)
if verbose:
print(f"Initial likelihood: {L:.4f}")
for it in range(max_iter):
moved = False
for i in range(N):
best_r, best_L = g[i], L
for r in range(K):
if r == g[i]:
continue
g_tmp = g.copy()
g_tmp[i] = r
L_tmp = compute_likelihood(A, g_tmp, K)
if L_tmp > best_L:
best_r, best_L = r, L_tmp
if best_r != g[i]:
if verbose:
print(f" Iter {it}, move node {i} → {best_r}, ΔL={(best_L - L):.4f}")
g[i], L = best_r, best_L
moved = True
if not moved:
if verbose:
print(f"Converged after {it} iterations.")
break
return g
def sbm_clustering_implementation(returns_matrix, num_clusters=5, threshold=0.5,
max_iter=100, random_state=42, verbose=False):
"""
Performs degree‐corrected SBM clustering on asset returns.
Args:
returns_matrix (np.ndarray): assets × time returns.
num_clusters (int): number of communities.
threshold (float): binarize |
corr| > threshold → edge.
max_iter (int): max passes for greedy optimization.
random_state (int): seed for reproducibility.
verbose (bool): print progress.
Returns:
clusters (np.ndarray): cluster label in {0..num_clusters-1} for each asset.
adjacency_matrix (np.ndarray): the binarized graph.
"""
# 1. compute asset–asset correlations
corr = np.corrcoef(returns_matrix)
# 2. threshold to get an unweighted graph
A = (np.abs(corr) > threshold).astype(int)
np.fill_diagonal(A, 0)
if verbose:
print("Adjacency matrix built — running DC-SBM...")
clusters = degree_corrected_sbm(
A, num_clusters,
max_iter=max_iter,
random_state=random_state,
verbose=verbose
)
if verbose:
print("Clustering complete.")
return clusters, A
# Main
if __name__ == "__main__":
np.random.seed(42)
num_assets = 50
# simulate 1 year of daily returns
mock_returns = np.random.randn(num_assets, 252)
# inject two correlated groups
mock_returns[ 0:10, :] += np.random.randn(1, 252) * 0.5
mock_returns[10:25, :] += np.random.randn(1, 252) * 0.5
sbm_clusters, sbm_adj = sbm_clustering_implementation(
mock_returns,
num_clusters=4,
threshold=0.2,
max_iter=50,
verbose=True
)
print("\nCluster assignments:", sbm_clusters)The code generates a heatmap of the adjacency matrix after its rows and columns have been reordered according to the cluster assignments found by the SBM. Instead of a noisy, seemingly random pattern of connections, the reordered matrix will reveal a clear block structure. Dense, bright squares appear along the diagonal, representing the high intra-cluster connectivity. The off-diagonal regions are sparse and dark, representing the low inter-cluster connectivity.









