Five Python Libraries that Make Web Content Extraction Simple

1
46093

Content extraction

Content extraction from Web pages occurs in a variety of domains such as information retrieval, data mining, etc. This article introduces five power-packed Python libraries that make the process of content extraction from Web resources simple and quick.

The mammoth size of the World Wide Web with its explosive growth, both in terms of its contents and its variety, poses certain important challenges to end users. Searching for a required piece of information from large volumes is one of the key problems that the average Web user faces on a daily basis. We have moved from the era of ‘information scarcity’ to the era of ‘information overload’. Almost any keyword that you fire into a search engine returns results ranging in number from a few hundreds to many thousands and even a few million. Fetching the data required to satisfy the user’s information needs is one of the challenging research issues.
An important solution to this problem of information overload is to extract the contents from Web resources and present them in a form that is comparatively simpler for the user to understand. This article presents Python based solutions that handle Web content extraction. There is an array of libraries available in Python for content extraction, as illustrated in Figure 1.

The focus of this article is to introduce the following five power-packed Python libraries, each of which has some unique features:

  • html2text
  • Lassie
  • Newspaper
  • Python-Goose
  • Textract

html2text

The major task performed by the html2text library is the conversion of HTML pages into an easy-to-read plain text format. This library was initiated by Aaron Swartz. The installation of html2text can be done easily, with the following command:

pip install html2text

After successfully completing the installation process, the html2text library can be used either from the terminal or within a Python program. The terminal usage of html2text is shown in the following code:

html2text [(filename|url) [encoding]]

For example, the following command will fetch the text version of the Open Source For You website’s home page:

$html2text https://www.opensourceforu.com

This can be achieved through a Python program also, as shown below:

import html2text
import requests
#Make the proxy settings, if your access is through a proxy server
proxies={‘http’:’10.10.80.11:3128’,’https’:’10.10.80.11:3128’}
url = “https://www.opensourceforu.com”
res = requests.get(url, proxies = proxies, timeout=5.0 )
html = res.text
# extract the text from the html version
text = html2text.html2text(html)
print(text)
Figure 1
Figure 1: Web content extraction – Python libraries

Lassie
The Lassie library enables the retrieval of component information from Web pages. For example, the key words, images, videos and description can be fetched with the Lassie library. The installation of Lassie can be carried out with the following commands:

$ pip install lassie

…or,

$ easy_install lassie

The information from the Web page is gathered with the lassie.fetch() function as illustrated below:

import lassie
import pprint
url=”https://www.opensourceforu.com/2016/04/ubuntu-16-04-lts-xenial-xerus-official/”
pprint.pprint (lassie.fetch(url))

The execution of the above script results in the following output:

{‘description’: u’Ubuntu 16.04 LTS aka “Xenial Xerus” comes with new features like snap package and LXD container to deliver you an upgraded experience. Being as an LTS platform, the new version additionally comes with five-year support.’,

‘images’:[{‘src’:u’https://www.opensourceforu.com/wp-content/uploads/2016/04/Ubuntu-16-04-lts-150x150.jpg’,
‘type’: u’og:image’},
{‘src’: u’https://www.opensourceforu.com/favicon1.ico’,
‘type’: u’favicon’}],

‘keywords’: [u’Ubuntu’,
u’Ubuntu 16.04 LTS’,
u’Ubuntu 16.04’,
u’Ubuntu Xenial Xerus’,
u’Xenial Xerus’,
u’Canonical’],
‘locale’: u’en_US’,
‘title’: u”Ubuntu 16.04 LTS ‘Xenial Xerus’ goes official - Open Source For You”,
‘url’: u’https://www.opensourceforu.com/2016/04/ubuntu-16-04-lts-xenial-xerus-official/’,
‘videos’: []}

Lassie facilitates the fetching of all the images from the Web page with the all images = true option. The following code will provide information regarding all the images present in the page:

pprint.pprint (lassie.fetch(url, all_images=True))

Newspaper
The Newspaper library, developed and maintained by Lucas Ou-Yang, is specially designed for extracting information from the websites of newspapers and magazines. The objective of this library is to extract and curate the articles from the newspapers and similar websites. To install Newspaper, use the following commands.

For lxml dependencies, type:

$ sudo apt-get install libxml2-dev libxslt-dev

PIL is for recognising images.

$ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev

For Newspaper library installation, type:

$ pip3 install newspaper3k

The NLP corpora will be downloaded:

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

The Newspaper library is used to gather information associated with articles. This includes author names, publication dates, major images in the article, videos present in the article, key words describing the article and the summary of the article. A sample program with an article from the Open Source For You magazine is given below:

From newspaper import Article url = ‘https://www.opensourceforu.com/2016/02/ionic-a-ui-framework-to-simplify-hybrid-mobile-app-development/’

article = Article(url)

#1 . Download the article
article.download()

#2. Parse the article
article.parse()

#3. Fetch Author Name(s)
print(article.authors)

#4. Fetch Publication Date
print(“Article Publication Date:”)
print(article.publish_date)
#5. The URL of the Major Image
print(“Major Image in the article:”)
print(article.top_image)

#6. Natural Language Processing on Article to fetch Keywords
article.nlp()
print (“Keywords in the article”)
print(article.keywords)

#7. Generate Summary of the article
print(“Article Summary”)
print(article.summary)

The output of the program is as follows:

[u’Article Written’, u’K S Kuppusamy’]
Article Publication Date:
2016-02-14 12:25:28+00:00
Major Image in the article:
https://www.opensourceforu.com/wp-content/uploads/2016/01/ionic-html-and-Angular-js-visual2.jpg
Keywords in the article
[u’development’, u’smartphone’, u’simplify’, u’mobile’, u’apps’, u’app’, u’hybrid’, u’framework’, u’ui’, u’browser’, u’ionic’, u’native’]

Article Summary
To assist developers in increasing the speed of app development, there are various app development frameworks.
Native app testing (device): Another method to test the app is as the native app in the device. Ionic is a promising open source mobile UI framework that makes the hybrid app development process swifter and smoother. This article presents Ionic, which is a framework for hybrid mobile app development.
So native app development requires specialized development teams.

The major features of the Newspaper library are its multi-threaded article-downloading framework, summary extraction and support for around 20 languages.

Figure 2
Figure 2: Python-Goose features

Python-Goose
Goose is a popular library for article extraction, which was originally developed for the Java ecosystem. Python-Goose is a rewrite of Goose in Python. The primary objective of Goose is to extract information from news articles. It can be installed with the following commands:

mkvirtualenv --no-site-packages goose
git clone https://github.com/grangier/python-goose.git
cd python-goose
pip install -r requirements.txt
python setup.py install

The Python version of the Goose library was developed by Xavier Gragnier. The features of Python-Goose are as illustrated in Figure 2.
The Newspaper library, which was covered earlier, uses the Goose library.

from goose import Goose

url = ‘https://www.opensourceforu.com/2016/02/ionic-a-ui-framework-to-simplify-hybrid-mobile-app-development/’
g = Goose()
article = g.extract(url=url)
print(“Title:”)
print(article.title)
print(“Meta Description”)
print(article.meta_description)

print(“First n characters of the cleaned text”)
print(article.cleaned_text[:20])
print (“URL of Main image”)
print(article.top_image.src)

The output is:

Title:
Ionic: A UI Framework to Simplify Hybrid Mobile App Development

The meta description is:

Ionic makes the task of creating hybrid apps very easy, and these can be ported to various platforms.

First n characters of the cleaned text

This article present

URL of main image

https://www.opensourceforu.com/wp-content/uploads/2016/01/Figur-1-350x390.jpg
Figure 3
Figure 3: A sample image for text extraction

Textract
The Web includes resources apart from HTML pages. Textract is a library to extract data from those resource file formats. In simple terms, it can be described as a library to extract text from any type of file—from resources such as Word documents, PowerPoint presentations, PDFs, etc. Textract attempts to extract text from gif, jpg, mp3, ogg, tiff, xls, etc, and has various dependencies to handle these file formats. To install it, use the following commands:

apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox

pip install textract

The extraction of text is carried out with the textract.process() function. The text extraction from the image shown in Figure 3 can be carried out with the following simple program:

import textract
text = textract.process(“oa.jpg”)
print text
OPEN aACCESS

The Textract extraction process depends upon the quality of the source document. Though it is not guaranteed to extract 100 per cent correct text from all the sources, it provides an insight into the type of content residing in the resource, which is definitely key information to have in many scenarios.
To summarise, this article aims to provide an eye-opener to the domain of content extraction from the Web. Though the above-mentioned libraries provide powerful functions for the content extraction process, they do pose research level challenges. Interested developers could take these up and provide better solutions.

 

1 COMMENT

  1. Thank you for this article sir. I explored the python Library – Newspaper for my task and found it to be very efficient and simple.

LEAVE A REPLY

Please enter your comment!
Please enter your name here