Using Reinforcement Learning To Run A Cleaner Bot

March 18, 2026

Reinforcement learning can be used to program a cleaner bot to clean the floor. The example given here uses the Markov Decision Process and code in Python.

Reinforcement learning (RL) interacts with a dynamic environment and learns from the consequences of its actions. This is different from supervised learning, where the model is trained on a labelled dataset, or unsupervised learning, where the model looks at unlabelled data and tries to find hidden patterns or structures on its own.

A cleaner bot is a classic ‘toy problem’ used to show how an autonomous agent should make decisions under uncertainty. This simplified model of an automated robotic vacuum cleaner balances its functions between two conflicting goals:

Maximising work done (cleaning floors)
Minimising risk (running out of battery)

As the actions of the cleaner bot have uncertain outcomes, the Markov Decision Process (MDP) is used to describe the environment. The outcomes of the actions are as follows:

The reward: It gets points for every square metre cleaned.
The cost: Every movement consumes battery.
The risk: If the battery hits zero while it’s far from the dock, it dies (a large negative reward).

An example: The office vacuum cleaner

Imagine a robot vacuum cleaner is assigned to clean a wide lobby in an office. It has the following three possible states and three possible actions.

The three states are:

High battery: The robot is fully charged.
Low battery: The robot is at 15% battery and the warning light is on.
Charging: The robot is docked at the charging point.

The three actions are:

Search: Drive around and clean (high reward, high battery consumption).
Wait: Stay still and save energy (no reward, no battery use).
Recharge: Move back to the charging point (no reward, restores battery).

In this example, the robot must decide its actions in the low battery state using the Markov Decision Process. This is where RL comes into action. There are two different scenarios.

Risk-take: The bot chooses to clean more floors (reward, say +10), but there is a 20% probability that the battery will die before it finishes the job. In that case the reward is -50. The probability is known as the Markov Transition.

Conservative: Due to low battery, the bot chooses to recharge. Then it gets 0 reward for cleaning, but it is 100% certain to reach the high battery state for the next cleaning phase.

The logic of reinforcement learning

To handle the two situations the cleaner bot uses a value function. It calculates the expected future reward for both scenarios:

Scenario 1: Expected value:

Scenario 2: Expected value:

In the first scenario, the expected negative value is worse than a guaranteed path to a full battery. In the second scenario, the RL agent learns that the optimal policy is to head to the charger as soon as the battery is low.

In the real world, the same logic is used for:

Mars Rovers

Deciding whether to explore an unsafe area or stay in the sun to charge solar panels.

Drone delivery

Calculating if a drone has enough battery to reach a customer and return to base.

Industrial sensors

Deciding how often to communicate data to a server to save internal battery life.

Cleaner bot components

Component	Example in a cleaner bot
Agent	The vacuum’s software
Environment	The lobby area and charging point
State	Battery level (High/Low)
Action	Search, wait, or recharge
Reward	+10 for cleaning; -50 for a dead battery

A Markov Chain simply describes a sequence of states where the future state depends only on the present one. An MDP adds actions and rewards to that chain.

A Markov Chain-based cleaner bot mechanism decides whether to continue working or return to its charging station based on its battery state.

The concept: From chain to decision

In a standard Markov Chain model, the states and transitions are:

States (S): High battery, low battery.
Actions (A): Clean, wait, recharge.
Transition probabilities (P): The percentage of the battery drops after an action.
Rewards (R): Positive for cleaning, negative for a dead battery.

Here’s an implementation in Python.

Define the transition matrices for each action and solve for the best strategy:

import numpy as np
# Define States
# 0: High Battery, 1: Low Battery
states = [“High”, “Low”]
# Define Rewards for actions in specific states
# Format: [Action_Clean, Action_Wait, Action_Recharge]
rewards = {
“High”: [10, 2, -1], # Best to search when high
“Low”: [-5, 1, 0] # Searching when low risks a ‘dead’ battery (-5)
}
# Transition Matrices (The Markov part)
# Rows = Current State, Cols = Next State
# If we Clean
P_search = np.array([
[0.8, 0.2], # High -> High (80%), High -> Low (20%)
[0.0, 1.0] # Low -> High (0%), Low -> Low (100% - stayed low)
])
# If we ‘Recharge’
P_recharge = np.array([
[1.0, 0.0], # High -> High (100%)
[1.0, 0.0] # Low -> High (100% - battery recovered)
])
def simulate (current_state_idx, action):
if action == “search”:
transition_probs = P_search[current_state_idx]
reward = rewards[states[current_state_idx]][0]
elif action == “recharge”:
transition_probs = P_recharge[current_state_idx]
reward = rewards[states[current_state_idx]][2]
# Randomly pick next state based on Markov probabilities
next_state_idx = np.random.choice([0, 1], p=transition_probs)
return next_state_idx, reward
# Simple Simulation
current_state = 0 # Start at “High”
total_reward = 0

print(f”Starting in state: {states[current_state]}”)
for day in range(5):
# Logic: If battery is low, recharge. Else, search.
action = “recharge” if current_state == 1 else “search”
next_state, reward = simulate (current_state, action)
print (f”Day {day+1}: State={states[current_state]}, Action={action}, Reward={reward}”)

current_state = next_state
total_reward += reward
print (f”Total Reward Earned: {total_reward}”)

The outputs are:

Starting in state: High
Day 1: State=High, Action=search, Reward=10
Day 2: State=Low, Action=recharge, Reward=0
Day 3: State=High, Action=search, Reward=10
Day 4: State=Low, Action=recharge, Reward=0
Day 5: State=High, Action=search, Reward=10

Total Reward Earned: 30

Key Markov properties in this code are:

Memoryless property

The ‘simulate step’ function only cares about the current_state_idx. It doesn’t care about the past performance, and the probability of the battery dropping remains a constant 0.3.

Transition matrix

Since the model is an MDP, the P_search and P_recharge arrays are stochastic matrices (rows sum to 1.0).

State space

The problem considers the complex robot states and transitions as finite.

Summary of logic and dataflow

State space

The agent looks at its current battery conditions:

states = [High, Low]

Consult policy

Based on the battery, it picks an action (Search or Recharge). This is defined using a dictionary, where each state [High, Low] has a list of rewards corresponding to the actions [Search, Wait, Recharge].

a. rewards = {
b. “High”: [10, 2, -1], # High reward for searching
c. “Low”: [-5, 1, 0] # Penalty (-5) if searching fails on low battery
d. }

-1 is to discourage the robot to recharge if the battery is already full and -5 is to discourage search when the battery is low.

Consult Markov matrix

A Markov transition matrix defines the probability of moving from State A to State B given a specific action:

a. P_search = np. array ([
b. [0.7, 0.3], # High -> High (70%), High -> Low (30%)
c. [0.0, 1.0] # Low -> High (0%), Low -> Low (100%)
d. ])

Search: There is a 30% chance it will transition to State 1 (low).

Recharge: There is a 100% chance that State 1 will remain in State 1 (low).

Receive reward

The agent collects reward points based on the action-state pair (search/recharge, state). The function is simulate_step (current_state_idx, action).

Update

The next_state becomes the current_state for the next loop.

The decision logic (The policy)

In this simple example, the hardcoded policy is: ‘If battery is low, recharge; otherwise, search’.

action = “recharge” if current_state == 1 else “search”

This Python code will simulate day-wise cleaning, charging and reward calculation for five days.

Reinforcement learning is based on machine learning; it acts based on the state and action of the problem domain. It is not data-driven and can function in an uncertain environment, whereas supervised and unsupervised learning are data-driven and operate in a known domain.

An example: The office vacuum cleaner

The logic of reinforcement learning

Mars Rovers

Drone delivery

Industrial sensors

Cleaner bot components

The concept: From chain to decision

Memoryless property

Transition matrix

State space

Summary of logic and dataflow

State space

Consult policy

Consult Markov matrix

Receive reward

Update

The decision logic (The policy)

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY