An Introduction To Topic Modelling In NLP

0
26
Topic Modelling in NLP

Topic modelling in natural language processing is used to categorise information, organise huge text data, obtain a summary of a large corpus, and improve recommendation systems by identifying commonalities within the corpus. Let’s explore the LDA technique to implement the topic modelling of a corpus.


In natural language processing (NLP), topic modelling is an automatic unsupervised machine learning technique that determines the abstracts (topics) present in a large corpus. Words with analogous meanings are grouped into topics, which helps in finding patterns in a textual data corpus without a training dataset.

Topic modelling techniques

Among the common topic modelling techniques, Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NMF) and Latent Semantic Analysis (LSA) are used the most. Among these three, LDA is the most efficient and popular. It is a generative probabilistic model and considers that topic words are distributed over the documents. It uses Dirichlet priors to ensure that topic words dominate the documents. We shall now discuss the LDA technique to implement the topic modelling of a corpus.

To implement topic modelling ‘choosing the right number of topics’ is essential, since too few topics oversimplify the task and too many lead to an overlap. Handling ‘stopwords’ and ‘noisy data’ needs special care to ensure meaningful outcomes of topics. Most importantly, human interpretation is needed to identify the proper topic from the cluster of output words.

Topic modelling can be used for high-level summarisation of a text corpus and for:

  • Customer feedback analysis: Identifying recurring themes in customer reviews.
  • Fake news detection: Analysing and classifying news articles.
  • Medical research: Discovering patterns in clinical literature.
  • Legal document analysis: Categorising case laws and contracts.

LDA components

LDA is often used for topic identification in large text corpora. It can be used to group similar documents based on their shared topic. In a literature survey, this technique is used to identify similar articles. Text summarisation is another important application of this technique. LDA consists of three key components.

Documents as topic mixtures

Each document is represented as a mixture of multiple topics, where each topic has a probability distribution. This is called alpha control, where high values indicate more topics within the documents.

Topics as word distributions

Here, each topic is characterised by a probability distribution over words and is called beta control. It controls the topic-word distribution; a high value represents topics as a mixture of more words.

Dirichlet priors

The model assumes Dirichlet distributions for the topic distribution in documents and the word distribution in topics, which helps in controlling sparsity. There is a hidden variable Z, which represents the latent topic assignments for each word in a document.

LDA algorithm

1. Initialization: Randomly assign the topics to each word in each document.

2. Iterate:

  • Based on the frequency of words, reassign topics to words.
  • Apprise the topic distributions over words and the document distributions over topics.

3. Convergence: Iterative step 2 continues till the topic distribution is not stable.

Implementation in Python

In Python, LDA is implemented using the Gensim module. This module processes raw, unstructured digital texts using unsupervised machine learning algorithms. Other than this module, punkt is also used to learn parameters from a corpus in an unsupervised way that is related to the target domain, such as a list of abbreviations, acronyms, etc. To normalise the given text, it is necessary to filter out all stop-words and punctuation. For further normalisation, it’s required to reduce words to their base form. For example, stemming the words ‘sung’, ‘sang’, ‘sings’, ‘singer’, and ‘singing’ would return ‘sing’ for all five words. In general, stemming:

  • Strips common suffixes from the end of words.
  • Uses language-specific rules.
  • Can produce unmeaningful base forms.
  • Is often used as a plugin component for indexing.

Python NLTK supports all the modules for these functions and is required to load within the program module.

from gensim import corpora, models #Gensim module

from nltk.corpus import stopwords #stop word

from nltk.tokenize import word_tokenize #words tokenization

from nltk.stem import PorterStemmer # Stemming

punkt requires one-time loading with the
nltk.download(‘punkt’) command.

Implementation in Python

The Python program is given below and the program logic is as follows:

1. Load the corpus (lines 9-14).

2. Tokenization of each document and creation of a list of words (line 16).

3. Removal of punctuations from the list of words (line 18).

4. Stop-word removal: Removal of all stop-words from the list (lines 20-21).

5. Stemming (optional): Replace all words with their base form (lines 23-24).

6. Prepare a list of lists for BOW (line 26).

7. Prepare a dictionary of the frequency of words from the filtered document list (line 28).

8. Prepare a corpus from the BOW of the filtered document list (line 29).

9. Train the LDA model with the corpus for a given number of topics with respect to the dictionary of words. The number of topics should be judicially assigned to meet the purpose (line 31).

10. The LDA training will assign a probabilistic value of each topic to each word (lines 33-34).

11. Define a function to get a suitable title (lines 36-44).

a. Create a BOW from the dictionary.

b. Get a distribution of topics from the LDA model.

c. Identify the maximum probabilistic value for the possible topics.

d. Extract the top ‘n’ words from the dominant topics.

e. Join the top ‘n’ words to form the topic title.

12. Assign new text to the trained LDA model for topic discovery (line 46).

13. Call the title discovery function (step 11) to get a possible title (line 47).

14. Print the title (line 48).

from gensim import corpora, models

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

from nltk.stem import PorterStemmer

import string

import nltk

nltk.download(‘punkt’)

# Document

documents = [“Natural language processing enables computers to understand, interpret, and modify human language.”,

“Nowadays natural language corpora like emails, text messages, social media newsfeeds, video calls, and instant messaging have become integral parts of our daily communication.”,

“Software for natural language processing analyzes this data to determine the sentiment or purpose of the message and responds to human communication instantly.”]

# Tokenization

texts = [word_tokenize(doc.lower()) for doc in documents]

# Remove punctuation

words_no_punct = [word for lstitm in texts for word in lstitm if word not in string.punctuation]

# Remove stopwords

stopwords = set (stopwords.words(“english”))

filtered_words = [word for word in words_no_punct if word.lower() not in stopwords]

#Stemming

stemmer = PorterStemmer()

filtered_doc = [stemmer.stem(word) for word in filtered_words]

#Preparation for BOW

filtered_doc = [filtered_doc] # as doc2bow requires a list of list

# Create a dictionary and corpus

dictionary = corpora.Dictionary(filtered_doc)

corpus = [dictionary.doc2bow(text) for text in filtered_doc ]

# Train LDA model

lda_model = models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=10)

# Print word distribution in a topic

for topic_id in range(lda_model.num_topics):

print(f”Topic {topic_id}:”, lda_model.show_topic(topic_id))

# Function to generate a title from a document

def generate_title(doc):

bow = dictionary.doc2bow(doc) # Convert document to bag-of-words

topic_distribution = lda_model.get_document_topics(bow) # Get topic distribution

dominant_topic = max(topic_distribution, key=lambda x: x[1])[0] # Select most probable topic

# Extract top words from the dominant topic

top_words = [word for word, _ in lda_model.show_topic(dominant_topic, topn=3)]

# Generate a title by joining top words

title = “ “.join(top_words).title()

return title

# Generate a title for a new document

new_doc = text[0]

generated_title = generate_title(new_doc)

print(“Generated Title:”, generated_title)

The output is:

Topic 0: [(‘language’, 0.02 8249461), (‘natural’, 0.028061535), (‘processing’, 0.028007403), (‘communication’, 0.027926473), (‘messaging’, 0.027820466), (‘human’, 0.027805243), (‘computers’, 0.02779292), (‘purpose’, 0.027780347), (‘instantly’, 0.02777591), (‘emails’, 0.027773501)]

Topic 1: [(‘language’, 0.07260038), (‘natural’, 0.056469582), (‘human’, 0.040358953), (‘communication’, 0.0403232), (‘processing’, 0.040299334), (‘modify’, 0.02420806), (‘calls’, 0.024204206), (‘sentiment’, 0.024201188), (‘understand’, 0.024201008), (‘enables’, 0.024199806)]

Generated Title: Language Natural Human

Topic modelling identifies recurring topics in a set of documents. This unsupervised model offers a simple and effective starting point for quick text classification. In order to determine the most likely title for a given text, it generates a set of tokens and their frequency from a corpus. Proper corpus training and word collecting are essential for success. The number of desired topics should also be minimal so that the target text is covered inclusively. In contrast to the other two methods, this extractive model is simple to comprehend and can be implemented with minimal effort using Python. I hope this post will inspire NLP fans to delve further in this domain.

LEAVE A REPLY

Please enter your comment!
Please enter your name here