Understanding Underfitting And Overfitting In Machine Learning

April 3, 2026

114

unerfitting and overfitting — Concepts of underfitting and overfitting

Machine learning models must balance the concepts of underfitting and overfitting.

In machine learning, a model’s goal is not just to fit the given data, but to generalise well to unseen data. Two common failure modes prevent this:

Underfitting – the model is too simple to capture the true pattern.
Overfitting – the model is too complex and memorises the training data.

I will demonstrate both concepts using simple, executable Python code, and a dataset generated from a known mathematical formula. This makes it easy to compare predicted values with true values.

Dataset with known ground truth

To clearly understand model behaviour, we generate data using a known formula:

This quadratic relationship allows us to check:

How well a model fits training data
How it behaves on unseen values

import numpy as np
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([x[0]**2 + 2*x[0] + 1 for x in X])

Underfitting example: Linear regression

Linear regression assumes a straight-line relationship between input and output. Since our data is quadratic, this assumption is incorrect. Here’s the code:

from sklearn.linear_model import LinearRegression

model = LinearRegression()

# Train

model.fit(X, y)

# Predict

print(“Underfitting (Linear Regression)”)

print(“x | predicted | true”)

print(“--+-----------+------”)

print(“Seen (training range)”)

for v in [1, 3, 5, 7, 9]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

print(“\nUnseen (outside training range)”)

for v in [11, 12, 15]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

The output is:

x | predicted | true

--+-----------+------

Seen (training range)

1 | -8.00 | 4

3 | 18.00 | 16

5 | 44.00 | 36

7 | 70.00 | 64

9 | 96.00 | 100







Unseen (outside training range)

11 | 122.00 | 144

12 | 135.00 | 169

15 | 174.00 | 256

This is what is observed:

The model makes noticeable errors even on training data.
Errors increase smoothly for unseen values.
The model cannot capture curvature.
This is a classic case of underfitting:
The model is too simple.
It fails to learn the true data pattern.

Overfitting example: K-Nearest Neighbors (K = 1)

K-Nearest Neighbors (KNN) makes predictions based on nearby data points. With K = 1, the model simply copies the nearest training value. The code is:

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=1)

# Train

model.fit(X, y)

# Predict

print(“Overfitting (KNN(N=1)”)

print(“x | predicted | true”)

print(“--+-----------+------”)

print(“Seen (training range)”)

for v in [1, 3, 5, 7, 9]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

print(“\nUnseen (outside training range)”)

for v in [11, 12, 15]:

y_pred = model.predict([[v]])[0]

y_true = v*v + 2*v + 1

print(f”{v:2d} | {y_pred:9.2f} | {y_true:4d}”)

The output is:

x | predicted | true

--+-----------+------

Seen (training range)

1 | 4.00 | 4

3 | 16.00 | 16

5 | 36.00 | 36

7 | 64.00 | 64

9 | 100.00 | 100

Unseen (outside training range)

11 | 121.00 | 144

12 | 121.00 | 169

15 | 121.00 | 256

It can be observed that:

Predictions are perfect on training data.
For unseen values, the model repeats the nearest known output.
It completely fails to follow the quadratic growth.
This demonstrates overfitting:
The model memorises training data.
It does not learn the underlying function.
The bias–variance interpretation is:
Underfitting corresponds to high bias and low variance. The model makes strong assumptions and is consistently wrong.
Overfitting corresponds to low bias and high variance. The model fits training data perfectly but is unstable on new data.

The goal of machine learning is to find a balance between these two extremes as the table below indicates.

Model	Training error	Test error	Behaviour
Linear regression	High	High	Underfit
KNN (K=1)	Zero	Very High	Overfit
Ideal	Balanced	Low	Good

To conclude, underfitting means not learning enough while overfitting means learning too much. Good models learn the pattern, not the data points.

Dataset with known ground truth

Underfitting example: Linear regression

Overfitting example: K-Nearest Neighbors (K = 1)

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY