R Series: ‘stringr’ Package

0
179
Stringr

We have been exploring various R packages for handling text for natural language processing. In this twenty-third article in the R, Statistics and Machine Learning series, we delve into the ‘stringr’ package, which provides a comprehensive set of functions to easily work with strings.

We will use R version 4.1.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version
R version 4.1.2 (2021-11-01) -- “Bird Hippie”
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type ‘license()’ or ‘licence()’ for distribution details.

You can install and load the ‘stringr’ package using the following commands:

> install.packages(“stringr”)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
...
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (stringr)

> library(stringr)

str_to_

The str_to_() functions provide methods to transform strings to upper case and lower case, format titles, and convert text into a sentence format. The syntax for this function is as follows:

str_to_<function>(string, locale = “en”)

A few examples are given below:

> t <- “R, Statistics and Machine Learning”

> str_to_upper(t)
[1] “R, STATISTICS AND MACHINE LEARNING”

> str_to_lower(t)
[1] “r, statistics and machine learning”

> str_to_title(t)
[1] “R, Statistics And Machine Learning”

> str_to_sentence(t)
[1] “R, statistics and machine learning”

str_count

You can count the number of occurrences of a character in a string with the str_count() function. The syntax usage is as follows:

str_count(string, pattern = “”)

A couple of examples are given below for reference:

> str_count(t, “a”)
[1] 4

> str_count(t, c(“a”, “e”))
[1] 4 2

str_dup

You can duplicate a string with the str_dup() function, which accepts an input string and a number for replication. The number of times for duplication can also be a list as shown below:

> str_dup(t, 1)
[1] “R, Statistics and Machine Learning”

> str_dup(t, 2)
[1] “R, Statistics and Machine LearningR, Statistics and Machine Learning”

> str_dup(t, 1:3)
[1] “R, Statistics and Machine Learning”
[2] “R, Statistics and Machine LearningR, Statistics and Machine Learning”
[3] “R, Statistics and Machine LearningR, Statistics and Machine LearningR, Statistics and Machine Learning”

str_detect

The str_detect() function returns a TRUE Boolean value if the pattern match exists in the given input string, and FALSE otherwise. You can use the regex() syntax for specific patterns for a match. The negate argument, if set to TRUE, can return non-matching elements. Examples to demonstrate this function are given below:

> str_detect(t, “a”)
[1] TRUE

> str_detect(t, “[ae]”)
[1] TRUE

> str_detect(t, “^s”)
[1] FALSE

> str_detect(t, “g$”)
[1] TRUE
str_conv

The str_conv() function can help convert the encoding of a string from the default format. The syntax usage is as follows:

str_conv(string, encoding)

Examples that use the ISO-8859-1 encoding are given below for reference:

> str_conv(“\xa9”, “ISO-8859-1”)
[1] “©”

> str_conv(“\xbc”, “ISO-8859-1”)
[1] “¼”

> str_conv(“\xbd”, “ISO-8859-1”)
[1] “½”

> str_conv(“\xbe”, “ISO-8859-1”)
[1] “¾”

str_equal

You can compare if two strings are equal using the Unicode rules with the str_equal() function. It accepts the following arguments:

Argument Description
x A character vector
y Another character vector
locale ‘en’ for English
ignore_case Boolean value to ignore case

A couple of examples are shown below:

> str_equal(“hello”, “hi”)
[1] FALSE

> str_equal(“\u1342”, “\u1342”)
[1] TRUE

str_like

The pattern matching for a string for the SQL LIKE operator syntax is implemented with the str_like() function. The syntax usage is as follows:

str_like(string, pattern, ignore_case)

A few examples are given below:

> str_like(vowels, “a”)
[1] TRUE FALSE FALSE FALSE FALSE

> str_like(t, “Mach”)
[1] FALSE

> str_like(t, “%R%”)
[1] TRUE
str_match

The str_match() function does pattern matching as described in vignette (‘regular-expressions’) and as implemented by string. A couple of examples are given below:

> str_match(t, “[a-z]+”)
[,1]
[1,] “tatistics”

> str_match(t, “[a-zA-Z]+”)
[,1]
[1,] “R”

str_extract

The str_extract() function matches a pattern in a string, and obtains the same. It accepts the following arguments:

Argument Description
string  Input vector
pattern  Regular expression
group Return specified matched group
simplify TRUE returns character matrix
FALSE returns list of character vectors

A few examples are given below:

> str_extract(t, “\\d”)
[1] NA

> str_extract(t, “and”)
[1] “and”

> str_extract(t, “[a-z]+”)
[1] “tatistics”

> str_extract(t, “[a-zA-Z]+”)
[1] “R”

str_flatten

You can convert a character vector to a string using the str_flatten() function. It takes the following arguments:

Argument Description
string Input vector
collapse String to insert between elements
last Optional string for the final separator
na.rm Boolean to handle missing values

A few examples given below illustrate this function:

> vowels <- c(“a”, “e”, “i”, “o”, “u”)

> str_flatten(vowels)
[1] “aeiou”

> str_flatten(vowels[1:3], “-”)
[1] “a-e-i”

> str_flatten_comma(vowels)
[1] “a, e, i, o, u”

str_locate

The str_locate() function returns the beginning and end position of the first pattern match for a given input string. The str_locate_all() function returns all matching occurrences. A few examples are given below:

> str_locate(t, “a”)
start end
[1,] 6 6

> str_locate(t, “$”)
start end
[1,] 35 34

> str_locate_all(t, “a”)
[[1]]
start end
[1,] 6 6
[2,] 15 15
[3,] 20 20
[4,] 29 29

str_sort

You can order, rank or sort a character vector using the str_sort() function. It takes the following arguments:

Argument Description
x Character vector
decreasing TRUE for highest to lowest
FALSE otherwise (default)
na_last Boolean to handle NA values
numeric Sort digits numerically

A couple of examples are given below:

> str_sort(vowels)
[1] “a” “e” “i” “o” “u”

> f <- c(“beta”, “alpha”, “gamma”, “delta”)

> str_sort(f)
[1] “alpha” “beta” “delta” “gamma”

str_remove

The str_remove() function removes text that matches a pattern for an input string. It accepts two arguments — an input vector string, and a pattern. An example of the use of this function is given below:

> str_remove(t, “\\s”)
[1] “R,Statistics and Machine Learning”

> str_remove(t, “[aeiou]”)
[1] “R, Sttistics and Machine Learning”

str_replace

The str_replace() function replaces the first occurrence of the pattern with the replacement string. The syntax usage is as follows:

str_replace(string, pattern, replacement)

A couple of examples are as follows:

> str_replace(t, “\\s”, “-”)
[1] “R,-Statistics and Machine Learning”

> str_replace(t, “[aeiou]”, “ “)
[1] “R, St tistics and Machine Learning”

str_split

You can split a string into multiple segments using the str_split() function. It has multiple options.

The str_split() accepts a character vector and returns a list as shown below:

> str_split(t, “ “)
[[1]]
[1] “R,” “Statistics” “and” “Machine” “Learning”

The str_split_1() uses a single string and returns a character vector. For example:

> str_split_1(t, “and”)
[1] “R, Statistics “ “ Machine Learning”

The str_split_fixed() accepts a character vector and returns a matrix of values. A couple of examples are given below:

> str_split_fixed(t, “ “, 2)
[,1] [,2]
[1,] “R,” “Statistics and Machine Learning”

> str_split_fixed(t, “ “, 3)
[,1] [,2] [,3]
[1,] “R,” “Statistics” “and Machine Learning”

The str_split_i() takes a character vector and returns a character vector. A few examples are given below to demonstrate this function:

> str_split_i(t, “ “, 1)
[1] “R,”

> str_split_i(t, “ “, 2)
[1] “Statistics”

> str_split_i(t, “ “, 3)
[1] “and”

You are encouraged to read the stringr reference manual to learn more about its functions, arguments, options, and usage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here