R Series: Text Mining

0
73
text mining

In this twenty-first article in the R series, we will learn more about the Text Mining ™ package, a framework that offers R functions to process documents as well as handle text and data formats.

Text Mining ™ is a popular package that is useful in machine learning and natural language processing algorithms. We shall explore the various functions provided by the tm package in this article. We will use version 4.2.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version
R version 4.2.2 (2022-10-31) -- “Innocent and Trusting”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

You can install and load the tm package using the following commands:

> install.packages(‘tm’)
Installing package into ‘/home/shakthi/R/x86_64-pc-linux-gnu-library/4.1’
...
also installing the dependencies ‘NLP’, ‘slam’
...

** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (tm)

> library(tm)
Loading required package: NLP

getReaders

The getReaders() function lists the available functions to extract content from various data sources, as shown below:

> getReaders()
[1] “readDataframe” “readDOC”
[3] “readPDF” “readPlain”
[5] “readRCV1” “readRCV1asPlain”
[7] “readReut21578XML” “readReut21578XMLasPlain”
[9] “readTagged” “readXML”

DirSource

Argument Description
directory A character vector of file path names
encoding Character encoding
pattern Optional regular expression to match files
recursive Logical value to recurse into directories
ignore.case Logical value for case-sensitive pattern matching
mode Character strings – “binary”, “text”, or “” (No read)

 

The following example demonstrates the use of the DirSource() function.

> d <- DirSource(system.file(“texts”, “txt”, package=”tm”))
> d
$encoding
[1] “”

$length
[1] 5

$position
[1] 0

$reader
function (elem, language, id)
{
if (!is.null(elem$uri))
id <- basename(elem$uri)
PlainTextDocument(elem$content, id = id, language = language)
}
<bytecode: 0x555cec9d12c8>
<environment: namespace:tm>

$mode
[1] “text”
$filelist
[1] “/home/guest/R/x86_64-pc-linux-gnu-library/4.1/tm/texts/txt/ovid_1.txt”
...
attr(,”class”)
[1] “DirSource” “SimpleSource” “Source”
VectorSource

The Wikipedia page on ‘Open Source’ at https://en.wikipedia.org/wiki/Open_source has been downloaded as a PDF, converted to plain text using the Ghostscript (gs) tool and saved as ‘Open_source.txt’. We can read the content from this file into various lines, and create a vector source of input using the VectorSource() function. This data can be kept in memory for processing the R object using the VCorpus() function, as demonstrated below:

> input <- “Open_source.txt”
> connection <- file(input)
> lines <- readLines(connection)

> lines

[1] “Open source”
[2] “Open source is source code that is made freely available for possible modification and redistribution.”
[3] “Products include permission to use the source code,[1] design documents,[2] or content of the product. The”
[4] “open-source model is a decentralized software development model that encourages open”
[5] “collaboration.[3][4] A main principle of open-source software development is peer production, with products”
...
> feed <- VCorpus(VectorSource(lines))
> feed
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 1465

> summary(feed)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
...

The VCorpus() function accepts the following arguments:

Argument Description
x A source object for VCorpus
readerControl A list of control parameters; reader function for reading x; default language is ‘en’

stemCompletion

The stemCompletion() function provides heuristic completion of words. For the input feed ‘Open_source.txt’ file, the ‘op’ and ‘so’ stem constructs complete to ‘open-source’ and ‘software’, as shown below:

> stemCompletion(c(“op”, “so”), feed)
op so
“open-source” “software”

removeNumbers

You can remove numbers using the removeNumbers() function. For example:

> lines[3]
[1] “Products include permission to use the source code,[1] design documents,[2] or content of the product. The”
> removeNumbers(lines[3])

[1] “Products include permission to use the source code,[] design documents,[] or content of the product. The”

removePunctuation

The punctuation marks in the text can be removed using the removePunctuation() function, as demonstrated below:

> lines[12]
[1] “rise of the Internet.[10] The open-source software movement arose to clarify copyright, licensing, domain,”
> removePunctuation(lines[12])

[1] “rise of the Internet10 The opensource software movement arose to clarify copyright licensing domain”

The following arguments are accepted by the removePunctuation() function:

Argument Description
x A character vector or input text
preserve_intra_word_contractions Logical value to keep intra-word contractions
preserve_intra_word_dashes Logical value to keep intra-word dashes
ucp Logical value to use Unicode character properties
Additional arguments to be passed

removeWords

The removeWords() function removes words from the text, as illustrated below:

> lines[24]
[1] “instance, in the early years of automobile development a group of capital monopolists owned the rights to a”

> removeWords(lines[24], stopwords(“en”))
[1] “instance, early years automobile development group capital monopolists owned rights “

findFreqTerms

A term-document matrix can be constructed using the TermDocumentMatrix() function. You can then use the findFreqTerms() function to list the words between a lower and upper frequency, as follows:

> tdm <- TermDocumentMatrix(lines)
> findFreqTerms(tdm, 3, 5)
[1] “modification” “include” “decentralized” “encourages”
[5] “main” “principle” “production,” “blueprints,”
[9] “public.” “began” “code.” “limitations”
[13] “promotes” “became” “redistribution” “widely”
[17] “developers” “gained” “hold” “producers”
[21] “domain,” “general” “program” “design.”
...

The findFreqTerms() function accepts the following arguments:

Argument Description
x A document-term matrix
lowfreq Lower bound numeric frequency
highfreq Upper bound numeric frequency

findMostFreqTerms

A term frequency vector can be generated using the termFreq() function, and you can use it to find the most frequently used words in the document with the findMostFreqTerms() function, as shown below:

> tf <- termFreq(lines)
> findMostFreqTerms(tf)
the and open source open-source for
520 296 222 174 161 123

stopwords

A number of stopword lists are supported such as ‘catalan’, ‘romanian’, ‘german’, and ‘SMART’ (for English). You can list the stopwords defined for the list using the ‘kind’ argument to the stopwords() function, as shown below:

> stopwords(kind = “en”)
[1] “i” “me” “my” “myself” “we”
[6] “our” “ours” “ourselves” “you” “your”
[11] “yours” “yourself” “yourselves” “he” “him”
[16] “his” “himself” “she” “her” “hers”
[21] “herself” “it” “its” “itself” “they”
[26] “them” “their” “theirs” “themselves” “what”

stripWhitespace

Any extra whitespace can be removed from the input text using the stripWhitespace() function. For example:

> l <- removeWords(lines[24], stopwords(“en”))

> l
[1] “instance, early years automobile development group capital monopolists owned rights “

> stripWhitespace(l)
[1] “instance, early years automobile development group capital monopolists owned rights “

tokenizer

The text document can be tokenized using a tokenization algorithm. The Boost_tokenizer(), MC_tokenizer() and scan_tokenizer() functions are available. An example of the Boost_tokenizer() function is given below:

> getTokenizers()
[1] “Boost_tokenizer” “MC_tokenizer” “scan_tokenizer”

> Boost_tokenizer(lines[24])
[1] “instance,” “in” “the” “early” “years”
[6] “of” “automobile” “development” “a” “group”
[11] “of” “capital” “monopolists” “owned” “the”
[16] “rights” “to” “a”

weightBin

A binary weight on a term-document matrix can be computed using the weightBin() function, as demonstrated below:

> tdm <- TermDocumentMatrix(lines)

> weightBin(tdm)
<<TermDocumentMatrix (terms: 4167, documents: 1465)>>
Non-/sparse entries: 9623/6095032
Sparsity : 100%
Maximal term length: 139
Weighting : binary (bin)

inspect

You can view a detailed information on the corpus, term-document matrix or input text using the inspect() function, as illustrated below:

> inspect(tdm)
<<TermDocumentMatrix (terms: 4167, documents: 1465)>>
Non-/sparse entries: 9623/6095032
Sparsity : 100%
Maximal term length: 139
Weighting : term frequency (tf)
Sample :
Docs
Terms 1183 32 340 37 529 759 82 828 83 971
and 0 2 0 3 1 0 0 1 0 0
for 1 0 0 0 2 1 1 0 1 1
free 0 0 0 0 0 0 0 0 0 0
open 1 0 0 0 0 0 0 0 0 0
open-source 0 0 2 0 0 0 0 0 0 0
retrieved 0 0 0 0 0 0 0 0 0 0
software 0 0 1 0 0 0 0 0 0 0
source 1 0 0 0 1 0 0 0 0 0
that 0 0 0 1 0 0 0 0 0 0

writeCorpus

The writeCorpus() function can output a text format of the corpus to multiple files, as illustrated below:

> writeCorpus(feed, “/tmp/foo”)

$ ls /tmp/foo/
1000.txt 1061.txt 1121.txt 1182.txt 1242.txt
...

You are encouraged to read the manual page of the tm package to learn more about its functions, arguments, options and usage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here