Detecting Redundant and Near-Dependent Features in Real Datasets

0
60

QR decomposition, a classical linear algebra technique, can be used to detect and eliminate linearly dependent and near-dependent columns in the feature matrix.

In machine learning, real-world datasets often suffer from multicollinearity—a phenomenon where feature vectors are linearly dependent or nearly dependent. This redundancy can lead to several critical issues:

Numerical instability: Ill-conditioned matrices make matrix inversion unreliable.

High variance: Unstable regression coefficients that are highly sensitive to small changes in data.

Overfitting: Reduced model generalisation due to noise-sensitive features.

While traditional methods like correlation matrices identify pairwise relationships, they often fail to detect multivariate dependence. This article proposes a systematic approach using QR decomposition, a classical linear algebra technique, to detect and eliminate linearly dependent and near-dependent columns in the feature matrix.

Mathematical foundation

Let…

…be a feature matrix, where:

  • m = number of samples
  • n = number of features

We now introduce QR decomposition as a principled alternative. For a given matrix X, it can be factored as

…where:

  • Q is an m×m orthogonal matrix, and
  • R is an m×n upper triangular matrix.

In an upper triangular matrix, the diagonal elements reveal the linear dependencies of columns. If any of the diagonal elements are 0 or near 0, it indicates dependency.

Practical implementation in Python

Case A: Full rank matrix

In the case of a full rank matrix all the diagonal entries of R will be non-zero. Let’s take an example:

import numpy as np

# Full rank 3x3 matrix
X_full = np.array([

    [1, 2, 3],

    [0, 1, 4],

    [5, 6, 0]

])

Q_full, R_full = np.linalg.qr(X_full)

print(“Full Rank Matrix R:”)
print(R_full)
print(“Rank:”, np.linalg.matrix_rank(X_full))

Full Rank Matrix R:

[[-5.09 -6.27            -0.58]

 [ 0.             -1.27               -4.96]

 [ 0.             0.                       0.15]]
Rank 3

Observation: The rank is 3, and all diagonal values are significantly different from zero.

Case B: Rank deficient matrix

# Rank deficient 3x3 matrix

X_def = np.array([

    [1, 2, 3],

    [2, 4, 6],

    [3, 6, 9]

])

Q_def, R_def = np.linalg.qr(X_def)

print(“\nRank Deficient Matrix R:”)

print(R_def)

print(“Rank:”, np.linalg.matrix_rank(X_def))

Rank Deficient Matrix R:

[[-3.74 -7.48            -11.22]

 [ 0           1.98*10-15       3.97*10-15]

 [ 0           0             -3.94*10-31]]

Rank: 1

Observation: Here you can see that two of the diagonal elements are almost 0 (1.98×10-15 and -3.94×10-31). So the rank is 1.

r11 is -3.74 ≠ 0
r22 ≈  0
r33 ≈ 0

In floating-point arithmetic, exact zero rarely appears. Values close to machine precision (≈10-15) are treated as numerical zeros.

Since only the first diagonal entry of R is significantly non-zero, the matrix has rank 1. This implies that only the first column of X is linearly independent, and the remaining two columns are linear combinations of it.

Case C: Real-world application:
NASA Turbofan dataset

To demonstrate this at scale, we apply QR decomposition to the NASA C-MAPSS Turbofan Engine Degradation Dataset.

Now let’s take a real database from Kaggle.com at
https://www.kaggle.com/datasets/bishals098/nasa-turbofan-engine-degradation-simulation?resource=download&select=train_FD001.txt:

import numpy as np
import pandas as pd

# Load NASA dataset
df = pd.read_csv(“train_FD001.txt”, sep=” “, header=None)

# Keep only sensor columns (columns 0 to 25)
X = df.iloc[:, :26].values   # 26 sensor columns

print(“Shape:”, X.shape)

# QR decomposition
Q, R = np.linalg.qr(X)


rank = np.linalg.matrix_rank(X)

print(“Rank:”, rank)
print(“Number of columns:”, X.shape[1])
print(“Redundant columns:”, X.shape[1] - rank)

print(“\nAbsolute diagonal of R:”)
print(np.abs(np.diag(R)))

Shape: (20631, 26)
Rank: 20
Number of columns: 26
Redundant columns: 6

Absolute diagonal of R:
[8.50622114e+03 1.21149103e+04 3.14141179e-01 4.20835556e-02
5.77042252e+03 3.97731246e-10 5.99454189e+01 6.65820338e+02
7.72601912e+02 3.36614676e-13 1.97019588e-01 7.08089878e+01
5.97296214e+00 2.45961341e+03 4.41360662e-14 1.67406676e+01
4.74133747e+01 4.84852689e+00 6.59181616e+02 2.98356370e+00
1.91411652e-16 1.39192962e+02 3.10864101e-11 6.71705315e-13
1.48371815e+01 8.80322651e+00]

Results and interpretation: The output reveals that out of 26 columns, 6 are redundant. When looking at the diagonal of R, several values appear in the range of 10-10 to 10-16.

These near-zero values indicate features that provide no new information to a model. By identifying these via QR decomposition, data scientists can perform feature selection with mathematical certainty, ensuring more stable and efficient machine learning pipelines.

LEAVE A REPLY

Please enter your comment!
Please enter your name here