Home Audience Developers Understanding Underfitting And Overfitting In Machine Learning

Understanding Underfitting And Overfitting In Machine Learning

0
7

Machine learning models must balance the concepts of underfitting and overfitting.

In machine learning, a model’s goal is not just to fit the given data, but to generalise well to unseen data. Two common failure modes prevent this:

  • Underfitting – the model is too simple to capture the true pattern.
  • Overfitting – the model is too complex and memorises the training data.

I will demonstrate both concepts using simple, executable Python code, and a dataset generated from a known mathematical formula. This makes it easy to compare predicted values with true values.

Dataset with known ground truth

To clearly understand model behaviour, we generate data using a known formula:

This quadratic relationship allows us to check:

  • How well a model fits training data
  • How it behaves on unseen values
import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([x[0]**2 + 2*x[0] + 1 for x in X])

Underfitting example: Linear regression

Linear regression assumes a straight-line relationship between input and output. Since our data is quadratic, this assumption is incorrect. Here’s the code:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

# Train

model.fit(X, y)

# Predict

print(“Underfitting (Linear Regression)”)

print(“x | predicted | true”)

print(“--+-----------+------”)

print(“Seen (training range)”)

for v in [1, 3, 5, 7, 9]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

print(“\nUnseen (outside training range)”)

for v in [11, 12, 15]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

The output is:

x | predicted | true

--+-----------+------

Seen (training range)

1 | -8.00 | 4

3 | 18.00 | 16

5 | 44.00 | 36

7 | 70.00 | 64

9 | 96.00 | 100







Unseen (outside training range)

11 | 122.00 | 144

12 | 135.00 | 169

15 | 174.00 | 256

This is what is observed:

  • The model makes noticeable errors even on training data.
  • Errors increase smoothly for unseen values.
  • The model cannot capture curvature.
  • This is a classic case of underfitting:
  • The model is too simple.
  • It fails to learn the true data pattern.

Overfitting example: K-Nearest Neighbors (K = 1)

K-Nearest Neighbors (KNN) makes predictions based on nearby data points. With K = 1, the model simply copies the nearest training value. The code is:

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=1)

# Train

model.fit(X, y)

# Predict

print(“Overfitting (KNN(N=1)”)

print(“x | predicted | true”)

print(“--+-----------+------”)

print(“Seen (training range)”)

for v in [1, 3, 5, 7, 9]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

print(“\nUnseen (outside training range)”)

for v in [11, 12, 15]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

The output is:

x | predicted | true

--+-----------+------

Seen (training range)

1 | 4.00 | 4

3 | 16.00 | 16

5 | 36.00 | 36

7 | 64.00 | 64

9 | 100.00 | 100

Unseen (outside training range)

11 | 121.00 | 144

12 | 121.00 | 169

15 | 121.00 | 256

It can be observed that:

  • Predictions are perfect on training data.
  • For unseen values, the model repeats the nearest known output.
  • It completely fails to follow the quadratic growth.
  • This demonstrates overfitting:
  • The model memorises training data.
  • It does not learn the underlying function.
  • The bias–variance interpretation is:
  • Underfitting corresponds to high bias and low variance. The model makes strong assumptions and is consistently wrong.
  • Overfitting corresponds to low bias and high variance. The model fits training data perfectly but is unstable on new data.

The goal of machine learning is to find a balance between these two extremes as the table below indicates.

Model Training error Test error Behaviour
Linear
regression
High High Underfit
KNN (K=1) Zero Very High Overfit
Ideal Balanced Low Good

 

To conclude, underfitting means not learning enough while overfitting means learning too much. Good models learn the pattern, not the data points.

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here