QR decomposition, a classical linear algebra technique, can be used to detect and eliminate linearly dependent and near-dependent columns in the feature matrix.
In machine learning, real-world datasets often suffer from multicollinearity—a phenomenon where feature vectors are linearly dependent or nearly dependent. This redundancy can lead to several critical issues:
Numerical instability: Ill-conditioned matrices make matrix inversion unreliable.
High variance: Unstable regression coefficients that are highly sensitive to small changes in data.
Overfitting: Reduced model generalisation due to noise-sensitive features.
While traditional methods like correlation matrices identify pairwise relationships, they often fail to detect multivariate dependence. This article proposes a systematic approach using QR decomposition, a classical linear algebra technique, to detect and eliminate linearly dependent and near-dependent columns in the feature matrix.
Mathematical foundation
Let…
![]()
…be a feature matrix, where:
- m = number of samples
- n = number of features
We now introduce QR decomposition as a principled alternative. For a given matrix X, it can be factored as
![]()
…where:
- Q is an m×m orthogonal matrix, and
- R is an m×n upper triangular matrix.
In an upper triangular matrix, the diagonal elements reveal the linear dependencies of columns. If any of the diagonal elements are 0 or near 0, it indicates dependency.
Practical implementation in Python
Case A: Full rank matrix
In the case of a full rank matrix all the diagonal entries of R will be non-zero. Let’s take an example:
import numpy as np # Full rank 3x3 matrix X_full = np.array([ [1, 2, 3], [0, 1, 4], [5, 6, 0] ]) Q_full, R_full = np.linalg.qr(X_full) print(“Full Rank Matrix R:”) print(R_full) print(“Rank:”, np.linalg.matrix_rank(X_full)) Full Rank Matrix R: [[-5.09 -6.27 -0.58] [ 0. -1.27 -4.96] [ 0. 0. 0.15]] Rank 3
Observation: The rank is 3, and all diagonal values are significantly different from zero.
Case B: Rank deficient matrix
# Rank deficient 3x3 matrix X_def = np.array([ [1, 2, 3], [2, 4, 6], [3, 6, 9] ]) Q_def, R_def = np.linalg.qr(X_def) print(“\nRank Deficient Matrix R:”) print(R_def) print(“Rank:”, np.linalg.matrix_rank(X_def)) Rank Deficient Matrix R: [[-3.74 -7.48 -11.22] [ 0 1.98*10-15 3.97*10-15] [ 0 0 -3.94*10-31]] Rank: 1
Observation: Here you can see that two of the diagonal elements are almost 0 (1.98×10-15 and -3.94×10-31). So the rank is 1.
r11 is -3.74 ≠ 0 r22 ≈ 0 r33 ≈ 0
In floating-point arithmetic, exact zero rarely appears. Values close to machine precision (≈10-15) are treated as numerical zeros.
Since only the first diagonal entry of R is significantly non-zero, the matrix has rank 1. This implies that only the first column of X is linearly independent, and the remaining two columns are linear combinations of it.
Case C: Real-world application:
NASA Turbofan dataset
To demonstrate this at scale, we apply QR decomposition to the NASA C-MAPSS Turbofan Engine Degradation Dataset.
Now let’s take a real database from Kaggle.com at
https://www.kaggle.com/datasets/bishals098/nasa-turbofan-engine-degradation-simulation?resource=download&select=train_FD001.txt:
import numpy as np import pandas as pd # Load NASA dataset df = pd.read_csv(“train_FD001.txt”, sep=” “, header=None) # Keep only sensor columns (columns 0 to 25) X = df.iloc[:, :26].values # 26 sensor columns print(“Shape:”, X.shape) # QR decomposition Q, R = np.linalg.qr(X) rank = np.linalg.matrix_rank(X) print(“Rank:”, rank) print(“Number of columns:”, X.shape[1]) print(“Redundant columns:”, X.shape[1] - rank) print(“\nAbsolute diagonal of R:”) print(np.abs(np.diag(R))) Shape: (20631, 26) Rank: 20 Number of columns: 26 Redundant columns: 6 Absolute diagonal of R: [8.50622114e+03 1.21149103e+04 3.14141179e-01 4.20835556e-02 5.77042252e+03 3.97731246e-10 5.99454189e+01 6.65820338e+02 7.72601912e+02 3.36614676e-13 1.97019588e-01 7.08089878e+01 5.97296214e+00 2.45961341e+03 4.41360662e-14 1.67406676e+01 4.74133747e+01 4.84852689e+00 6.59181616e+02 2.98356370e+00 1.91411652e-16 1.39192962e+02 3.10864101e-11 6.71705315e-13 1.48371815e+01 8.80322651e+00]
Results and interpretation: The output reveals that out of 26 columns, 6 are redundant. When looking at the diagonal of R, several values appear in the range of 10-10 to 10-16.
These near-zero values indicate features that provide no new information to a model. By identifying these via QR decomposition, data scientists can perform feature selection with mathematical certainty, ensuring more stable and efficient machine learning pipelines.
















































































