Recommender Systems: Using a Live Data Set for Implementation

Recommender Systems Using a Live Data Set for Implementation

This two-part series of articles explains and demonstrates how to implement a recommender system for an online retail store using Python. The first part of the series, published in the May 2022 issue of Open Source For You, focused on the theory behind such a system. This second part explains how to implement that system.

The data set used for research can be found and accessed via Its details are given in the box below.

Solution methodology

So how does one create a system that recommends a specific number of products to users on an e-commerce website, based on their previous purchase history. To construct the answer, we employed the Skip-Gram modelling technique of the word2vec algorithm that we have seen in Part 1. However, we took a customer-focused approach to this.

Dr Daqing Chen, Director: Public Analytics Group (chend ‘@’, School of Engineering, London South Bank University, London.Data set information
This is a transnational data set that contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Attributes used in the data set

InvoiceNo: Invoice number. Nominal — a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter ‘c’, it indicates a cancellation.
StockCode: Product (item) code. Nominal — a 5-digit integral number uniquely assigned to each distinct product.
Description: Product (item) name. Nominal.
Quantity: The quantities of each product (item) per transaction. Numeric.
InvoiceDate: Invoice date and time. Numeric — the day and time when each transaction was generated.
UnitPrice: Unit price. Numeric — product price per unit in sterling pound.
CustomerID: Customer number. Nominal — a 5-digit integral number uniquely assigned to each customer.
Country: Country name. Nominal — the name of the country where each customer resides.

The suggested system incorporates consumer purchasing behaviour as a feature. Word2vec is used to create a vector representation of the products. It provides low-dimensional (50–500) and dense (not sparse; most values are non-zero) word embedding formats. As a result, if a user is browsing a product online, we can immediately suggest related items based on the product’s vector similarity score.

After gathering the necessary data and performing a train-test split on it, an attempt was made to create word and sentence equivalents comparable to those of a standard word2Vec model, which was then fed to the model for training.

We will try and explain the implementation by dividing it in three parts:

  • Data pre-processing
  • Exploratory data analysis
  • Model building

Data pre-processing

We will first take a quick look at the data set in this segment that we are going to evaluate, to see how we can make it easier and more useful for further research.

Let’s get started by importing the necessary libraries and the data set we’ll be working with. Because the ‘InvoiceDate’ attribute is in an unsatisfactory format, we fix it by using this date variable to format our desired type. We also have some wrong spellings and excessive white spaces in our strings that we would like to prevent, and we would like to keep our strings in upper case for the next steps.

We will install openpyxl using the command given below:

!pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
|████████████████████████████████| 242 kB 4.2 MB/s            
Collecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9

WARNING: Running pip as the ‘root’ user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead (

Next, we will import the necessary libraries and the data set:

import numpy as np, pandas as pd, re, scipy as sp, scipy.stats

#Importing Dataset
pd.options.mode.chained_assignment = None
datasetURL = ‘’
df1 = pd.read_excel(datasetURL)

Figure 1 shows the output of the above command.

Imported data set
Figure 1: Imported data set
#Formatting Date/Time
df1[‘InvoiceDate’] = pd.to_datetime(df1[‘InvoiceDate’], format = ‘%m/%d/%Y %H:%M’)

Checking the strings

df1[‘Description’] = df1[‘Description’].str.replace(‘.’,’’).str.upper().str.strip()
df1[‘Description’] = df1[‘Description’].replace(‘\s+’,’ ‘,regex = True)
df1[‘InvoiceNo’] = df1[‘InvoiceNo’].astype(str).str.upper()
df1[‘StockCode’] = df1[‘StockCode’].str.upper()
df1[‘Country’] = df1[‘Country’].str.upper()

As intuitive as the variables (column names) may sound, let’s take a step further by understanding what each variable means.

InvoiceNo (invoice_num): A number assigned to each transaction

StockCode (stock_code): Product code

Description (description): Product name

Quantity (quantity): Number of products purchased for each transaction

InvoiceDate (invoice_date): Time stamp for each transaction

UnitPrice (unit_price): Product price per unit

CustomerID (cust_id): Unique identifier for each customer

Country (country): Country name

Note: Product price per unit is assumed to follow the same currency throughout our analysis.

By understanding the data in a more descriptive manner, we notice two things:

  • Quantity has negative values
  • Unit price has zero values (Are these FREE items?)
Output after string formatting
Figure 2: Output after string formatting

We have some odd and irregular values in the ‘UnitPrice’ and ‘Quantity’ columns, as seen in the summary of our data set, which we will locate and eliminate to prevent them from negatively affecting our study. We can see that some of the transactions in the ‘StockCode’ variable are not actually products, but rather costs or fees related to the post or bank, or other transactions that we don’t really require in our data. We realise that some of these transactions contain returned products, and that in those transactions, the ‘InvoiceNo’ begins with a ‘c’ character and the ‘UnitPrice’ should be negative. However, we have purchases in our database with negative ‘UnitPrice’ and vice versa, which we need to correct. There are also certain tuples where the ‘UnitPrice’ is left blank. There are many missing or inaccurate values in the ‘Description’ attribute. To resolve this issue, we will delete transactions with no accessible description, check the ‘Description’ based on the product ‘StockCode’, and fill the missing values with the right ‘Description’ that is available from other transactions with the same ‘StockCode’.

Next, we will look for the missing and incorrect values:

df1.drop(df1[(df1.Quantity>0) & (df1.InvoiceNo.str.contains(‘C’) == True)].index, inplace = True)
df1.drop(df1[(df1.Quantity<0) & (df1.InvoiceNo.str.contains(‘C’) == False)].index, inplace = True)
df1.drop(df1[df1.Description.str.contains(‘?’,regex=False) == True].index, inplace = True)
df1.drop(df1[df1.UnitPrice == 0].index, inplace = True)

for index,value in df1.StockCode[df1.Description.isna()==True].items():
    if pd.notna(df1.Description[df1.StockCode == value]).sum() != 0:
        df1.Description[index] = df1.Description[df1.StockCode == value].mode()[0]
        df1.drop(index = index, inplace = True)

     df1[‘Description’] = df1[‘Description’].astype(str)

#Adding Desired Features
df1[‘FinalPrice’] = df1[‘Quantity’]*df1[‘UnitPrice’]
df1[‘InvoiceMonth’] = df1[‘InvoiceDate’].apply(lambda x: x.strftime(‘%B’))
df1[‘Day of week’] = df1[‘InvoiceDate’].dt.day_name()

The output of the execution will be:

(406789, 11)

Exploratory data analysis

In this section, we will visualise the data to have a clear vision and gain insights into it.

We import our data in this section and update to datetime so that we can work on the data from the time series. We’re going to start by showing what the top items we have sold around the globe are. There are two metrics that can show us how much benefit each product has produced. In the first plot of the subplot below, we can see the top 20 goods purchased by clients with respect to price and in the most quantities.

#importing necessary libraries and the cleaned dataset
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
%matplotlib inline
Cleaned_Data = df1
Cleaned_Data.index = pd.to_datetime(Cleaned_Data.index, format = ‘%Y-%m-%d %H:%M’)

#top 20 products by quantity and finalprice
TopTwenty = Cleaned_Data.groupby(‘Description’)[‘Quantity’].agg(‘sum’).sort_values(ascending=False)[0:20]

Top20Price = Cleaned_Data.groupby(‘Description’)[‘FinalPrice’].agg(‘sum’).sort_values(ascending=False)[0:20]
#creating the subplot
fig,axs = plt.subplots(nrows=2, ncols=1, figsize = (12,12))
plt.subplots_adjust(hspace = 0.3)
fig.suptitle(‘Best Selling Products by Amount and Value’, fontsize=15, x = 0.4, y = 0.98)
sns.barplot(x=TopTwenty.values, y=TopTwenty.index, ax= axs[0]).set(xlabel=’Total amount of sales’)
axs[0].set_title(‘By Amount’, size=12, fontweight = ‘bold’)
sns.barplot(x=Top20Price.values, y=Top20Price.index, ax= axs[1]).set(xlabel=’Total value of sales’)
axs[1].set_title(‘By Value’, size=12, fontweight = ‘bold’)

We can see the output in Figure 3.

Top 20 products by quantity and price
Figure 3: Top 20 products by quantity and price
ReturnedItems = Cleaned_Data[Cleaned_Data.Quantity<0].groupby(‘Description’)[‘Quantity’].sum()
ReturnedItems = ReturnedItems.abs().sort_values(ascending=False)[0:10]
ReturnCust = Cleaned_Data[Cleaned_Data.Quantity<0].groupby([‘CustomerID’,’Country’])[‘Quantity’].sum()
ReturnCust = ReturnCust.abs().sort_values(ascending=False)[0:10]

#creating the subplot
fig, [ax1, ax2] = plt.subplots(nrows=2, ncols=1, figsize=(12,10))
ReturnedItems.sort_values().plot(kind=’barh’, ax=ax1).set_title(‘Most Returned Items’, fontsize=15)
ReturnCust.sort_values().plot(kind=’barh’, ax=ax2).set_title(‘Customers with Most Returns’, fontsize=15)

Now let us find out the items that were returned the most, and the customers with the corresponding country:

Figure 4 shows the graph plotted for the above criteria.

Most returned items and the customers with the corresponding country
Figure 4: Most returned items and the customers with the corresponding country

Since we have got the day of the week in which the things were sold, we can use it to see the sales value by each day of the week. We can create the pie chart as shown in Figure 5.

#creating the pie chart
Cleaned_Data.groupby(‘Day of week’)[‘FinalPrice’].sum().plot(kind = ‘pie’, autopct = ‘%.3f%%’, figsize=(7,7)).set(ylabel=’’)
plt.title(‘Percentages of Sales Value by Day of Week’, fontsize = 17)
Sales value by day of the week
Figure 5: Sales value by day of the week

Model building

Let’s first save the data set to be used in an Excel file, so that we can use it according to our needs.

#Creating an Excel File for the Cleaned Data

#Loading the excel file cleaned data into a dataframe 

# convert the StockCode to string datatype
final_data[‘StockCode’]= final_data[‘StockCode’].astype(str)

# List of unique customers
customers = final_data[“CustomerID”].unique().tolist()

The output will be:


Our data set contains 4,371 customers. We will extract the purchasing history of each of these clients. In other words, we can have 4,371 different purchase sequences.

To construct word2vec embeddings, we’ll leverage data from 90 per cent of our clients. The remaining data set will be utilised for validation.

Let’s split the data set.

Train – test data preparation

# shuffle customer ID’s

# extract 90% of customer ID’s
customers_train = [customers[i] for i in range(round(0.9*len(customers)))]

# split data into train and validation set
train_df = final_data[final_data[‘CustomerID’].isin(customers_train)]
validation_df = final_data[~final_data[‘CustomerID’].isin(customers_train)]

Now, for both the train and validation sets, we will build sequences of purchases made by customers in the data set.

purchases_train = []

for i in tqdm(customers_train): ## We could have used tqdm(train_df)?? 
    temp = train_df[train_df[“CustomerID”] == i][“StockCode”].tolist()
purchases_val = []

for i in tqdm(validation_df[‘CustomerID’].unique()):
    temp = validation_df[validation_df[“CustomerID”] == i][“StockCode”].tolist()

model = Word2Vec(window = 10, sg = 1, hs = 0,
                 negative = 10, # for negative sampling
                 alpha=0.03, min_alpha=0.0007,
                 seed = 14)

model.build_vocab(purchases_train, progress_per=200)

model.train(purchases_train, total_examples = model.corpus_count, 
            epochs=10, report_delay=1)

Recommending products

To simply match a product’s description to its ID, let’s establish a product-ID and product-description dictionary.

products = train_df[[“StockCode”, “Description”]]

# remove duplicates
products.drop_duplicates(inplace=True, subset=’StockCode’, keep=”last”)

# create product-ID and product-description dictionary
products_dict = products.groupby(‘StockCode’)[‘Description’].apply(list).to_dict()

Figure 6: Recommendations

The output will be:


Let’s create a function which will take a product’s vector (v) as input and return the top six similar products.

def get_similar_item(v, n = 6):
    ms = model.similar_by_vector(v, topn= n+1)[1:]
    my_ms = []
    for j in ms:
        pair = (products_dict[j[0]][0], j[1])
    return my_ms

Let’s try out our function by passing the vector of the product ‘90019A’ (‘SILVER M.O.P ORBIT BRACELET’):


The output will be:

[(‘TURQUOISE GLASS TASSLE BAG CHARM’, 0.9264411926269531),
 (‘GREEN ENAMEL FLOWER RING’, 0.9179285764694214),
 (‘RED GLASS TASSLE BAG CHARM’, 0.9122185707092285),
 (‘BLUE GLASS CHUNKY CHARM BRACELET’, 0.9082530736923218),
 (‘AMBER DROP EARRINGS W LONG BEADS’, 0.9068467617034912),
 (‘GREEN MURANO TWIST BRACELET’, 0.9062117338180542)]

The above findings are quite relevant and fit the input product well, but the output is only based on a single product’s vector. Let’s work on creating a system that will propose things based on the consumers’ previous purchases. We’ll do this by averaging all the vectors of the products the consumer has purchased so far, and using the resulting vector to discover related things. Create a function that takes a list of product IDs and returns a 100-dimensional vector, which is the mean of the product vectors in the input list.

def my_aggr_vec(products):
    p_vec = []
    for i in products:
        except KeyError:
    return np.mean(p_vec, axis=0)


The output will be:

[(‘YELLOW DRAGONFLY HELICOPTER’, 0.6904580593109131),
 (‘JUMBO BAG RED RETROSPOT’, 0.6636316776275635),
 (‘CREAM HANGING HEART T-LIGHT HOLDER’, 0.6594936847686768),
 (‘DISCOUNT’, 0.6384902596473694),
 (‘JUMBO BAG STRAWBERRY’, 0.6319284439086914),
 (“WRAP 50’S CHRISTMAS”, 0.6274842023849487)]


[(‘YELLOW DRAGONFLY HELICOPTER’, 0.681573748588562),
 (‘CREAM HANGING HEART T-LIGHT HOLDER’, 0.6654664278030396),
 (‘JUMBO BAG RED RETROSPOT’, 0.6553223133087158),
 (‘WOODEN FRAME ANTIQUE WHITE’, 0.6327764391899109),
 (‘DISCOUNT’, 0.6313808560371399),
 (‘JUMBO BAG STRAWBERRY’, 0.6252866983413696)]

In the end, our system recommended six products based on a user’s complete buying history. The model can also be used to make product suggestions based on recent purchases. Let’s try it with only the previous ten purchases as input:


The output will be:

[(‘YELLOW DRAGONFLY HELICOPTER’, 0.6946737170219421),
 (‘JUMBO BAG RED RETROSPOT’, 0.6675459742546082),
 (‘CREAM HANGING HEART T-LIGHT HOLDER’, 0.6548881530761719),
 (‘DISCOUNT’, 0.6417276859283447),
 (“WRAP 50’S CHRISTMAS”, 0.6357795000076294),
 (‘JUMBO BAG STRAWBERRY’, 0.6348972320556641)]

Summing it up

Word2vec is a fast and powerful method for learning a meaningful representation of words from context. In this article it has been used to solve a shopping-related problem. It enables us to precisely model each product using a vector of coordinates that captures the context in which the product was purchased. This allows us to quickly find products that are comparable for each user.

Suggesting the appropriate products/items to users can enhance not just interaction rates but also product purchases, resulting in increased income. Using machine learning to find similar products also raises product visibility.


Please enter your comment!
Please enter your name here