NLP: A Quick Look At Term Frequency-Inverse Document Frequency

October 29, 2025

233

Natural language processing (NLP) helps computers understand and generate written and spoken words. NLP uses Term Frequency-Inverse Document Frequency to analyse the importance of a word within a document by assigning a numerical score to each word. Let’s find out how…

Term Frequency-Inverse Document Frequency (TF-IDF) is a method to convert text into numerical features. TF-IDF assigns higher weightages to important words in a corpus and reduces the importance of general words like ‘the’, ‘a’, ‘if’ by assigning lesser importance to them.

To establish the importance of a word, this approach first determines ‘how often the word appears in a document’; then it searches all the documents to determine ‘how rare the word is across all documents’. The first exercise is known as Term Frequency (TF) and the second is known as Inverse Document Frequency (IDF). The multiplication of these two parameters, i.e., TF x IDF, provides the weightage of that word. If a word is frequent in a specific document but rare across others, then the word is more important and gets a higher weightage.

This strategy helps us to remove the bias from frequently used uninformative words (e.g., ‘the’, ‘is’, ‘are’). It also identifies keywords of a document. This helps us to classify a document to identify its topic model by retrieving meaningful keywords from the target documents.

Simple example of TD-IDF

Assume we have three documents:
Doc1: “this magazine is great”
Doc2: “this magazine is bad”
Doc3: “The quality of this magazine is great”

Step 1: Vocabulary

Unique terms = {this, magazine, is, great, bad, quality, of}

Term	Doc1	Doc2	Doc3
this	low	low	low
magazine	high	high	high
is	low	low	low
great	high	0	high
bad	0	high	0
quality	0	0	high
of	0	0	high

Weightage = {this: low, magazine: high, is: low, great: high, bad: high, quality: high, of: low}

Term	Doc1	Doc2	Doc3
this	1	1	1
magazine	1	1	1
is	1	1	1
great	1	0	1
bad	0	1	0
quality	0	0	1
of	0	0	1

Words like ‘magazine’, ‘great’, ‘bad’, and ‘quality’ have higher TF-IDF weights because they are important in individual documents.

Step 2: Compute TF-IDF matrix

The components of TF-IDF are:

1. TF (Term Frequency): It measures the frequency of a term within a document.

2. IDF (Inverse Document Frequency): It measures the uniqueness of the words across the documents.

N = total number of documents

df(t) = number of documents containing the term ‘t’

3. TF-IDF score

Combining TF and IDF, the final TF-IDF formula is:

4. TF (Term Frequency) table

Term	Doc1 (TF)	Doc2 (TF)	Doc3 (TF)
magazine	1/4	1/4	1/7
great	1/4	0	1/7
bad	0	1/4	0
quality	0	0	1/7

5. IDF table

Term	Appearance in documents	IDF
magazine	d1, d2, d3	-0.12493874
great	d1, d3	0
bad	d2	0.176091259
quality	d3	0.176091259

6. TF-IDF table

Term	Doc1	Doc2	Doc3
magazine	-0.03	-0.03	-0.02
great	0.00	0.00	0.03
bad	0.00	0.04	0.00
quality	0.00	0.00	0.03

‘great’ and ‘quality’ have higher TF-IDF values because they are rare and carry information for the classification of these documents. ‘magazine’ has negative IDF values and is therefore less useful for the classification of these sentiments.

Implementation

Here is a Python implementation of TD-IDF with four documents. The TF-IDF text processing technique can be easily understood by using this code, which generates TF, IDF, and TF x IDF tables. Scikit-learn is the most appropriate source for the TF-IDF library.

from sklearn.feature_extraction.text import TfidfVectorizer

from tabulate import tabulate

import pandas as pd

pd.set_option(‘display.max_columns’, None)

# Sample documents

corpus = [

“this magazine is great”, #Doc1

“this magazine is bad”, #Doc2

“the quality of this magazine is great”, #Doc3

“the name of this excellent magazine is OSFY”, #Doc4

]

# Create TF-IDF vectorizer object

vectorizer = TfidfVectorizer(stop_words=’english’, use_idf=True)

# Fit and transform the corpus

vector = vectorizer.fit_transform(corpus)

# Vocabulary

# TF

feature_names = vectorizer.get_feature_names_out()

df= pd.DataFrame(vector.toarray(), columns=feature_names)

# IDF TABLE

# Get IDF values for each term

idf_values = vectorizer.idf_

idf_df = pd.DataFrame({‘term’: feature_names, ‘idf’: idf_values})

# Multiply TF × IDF

tf_df = df

manual_tfidf = tf_df * idf_df.set_index(‘term’).T.loc[‘idf’]

manual_tfidf = manual_tfidf.round(3)

# Display all

print(“ \u2732 TF Table:\n”, tabulate(round(df,3), “keys”, tablefmt=”psql”))

print(“\n \u2732 IDF Table:\n”, tabulate(round(idf_df.round(3),3), “keys”, tablefmt=”psql”))

print(“\n \u2732 TF × IDF (manual recomputed):\n”, tabulate(round(manual_tfidf.round(3),3), “keys”, tablefmt=”psql”))

The TF-IDF values for each feature word in documents 0, 1, 2, and 3 are displayed in the final TD-IDF table. While a frequent feature word (magazine) has lower scores and is therefore less significant, feature words with a rare appearance have higher TF-IDF values. These documents’ key terms include ‘bad’, ‘excellent’, ‘great’, ‘osfy’, and ‘quality’.

N-grams

The idea of n-grams is widely used in a variety of NLP applications, such as text prediction, information retrieval, and language modelling, where capturing the statistical characteristics of text becomes essential to understand the functionality of the underlying model. A contiguous sequence of ‘n’ items from a given text or speech is known as an n-gram in natural language processing. The object may be a word, syllable, character, etc. A ‘unigram’ is an n-gram of size 1, a ‘bigram’ is a size 2, and a ‘trigram’ is a size 3.

TF-IDF vectorization uses n-gram with an ngram_range parameter, which is a tuple (min_n, max_n). Min_n and max_n establish the lower and upper bounds of the n-gram sizes, allowing n-grams to be explored within this range. For example, if ngram_range is set to (1, 3), the ‘vectorizer’ will extract unigrams, bigrams, and trigrams, adding a wider context to the feature set used for machine learning models.

Selecting an appropriate n-gram range can have a considerable impact on the analysis of NLP tasks. Unigrams, for example, may not successfully convey context (e.g., ‘not excellent’ versus just ‘excellent’), whereas bigrams, trigrams, or even higher-level n-grams may.

Unigram TF-IDF

The sklearn package makes summarising a corpus an easy task. The following code is an illustration of unigram analysis.

from sklearn.feature_extraction.text import TfidfVectorizer
from tabulate import tabulate
import pandas as pd
# Sample documents
corpus = [
“this magazine is great”, #Doc1
“this magazine is bad”, #Doc2
“the quality of this magazine is great”, #Doc3
“the name of this excellent magazine is OSFY” #Doc4
]
# Create TF-IDF vectorizer object
vectorizer = TfidfVectorizer()
# Fit and transform the corpus
vector = vectorizer.fit_transform(corpus)
# Convert to array and print
#print(vector.toarray())
# Vocabulary
feature_names = vectorizer.get_feature_names_out()
df= pd.DataFrame(vector.toarray(), columns=feature_names)
print(round(df,3))

Bigram TF-IDF

Here is an example with the same corpus using bigram.

# Create TF-IDF vectorizer object
vectorizer = TfidfVectorizer(
max_features=500,
stop_words=’english’,
ngram_range=(2,2), # bigrams
min_df=1, # ignore rare words
max_df=0.85 # ignore too frequent words
)
# Fit and transform the corpus
vector = vectorizer.fit_transform(corpus)
vector_df = pd.DataFrame(vector.toarray())
# Vocabulary
feature_names= vectorizer.get_feature_names_out()
#print(feature_names)
#print(tabulate(vector_df, headers=feature_names, tablefmt=”grid”))
# Convert to array and print
df= pd.DataFrame(vector.toarray(), columns=feature_names)
print(round(df,3))

Bigram gives us four meaningful contextual phrases, such as ‘quality magazine’, ‘excellent magazine’, ‘bad magazine’, and ‘osfy magazine’, together with the matching TF-IDF values for each of the four documents. These four bigrams also serve as indicative keywords for the corpus and point to OSFY magazine as a high-quality publication.

Simple example of TD-IDF

Step 1: Vocabulary

Step 2: Compute TF-IDF matrix

Implementation

N-grams

Unigram TF-IDF

Bigram TF-IDF

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY