R Series: ‘dplyr’ Package

0
452
R Series

In this twenty-second article in the R, Statistics and Machine Learning series, we will continue on the exploration of handling text using the ‘dplyr’ package. This package is a grammar for data manipulation, and provides simple and fast functions to handle data frame-like objects.

We will use R version 4.2.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version
R version 4.2.2 (2022-10-31) -- “Innocent and Trusting”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

You can install and load the ‘dplyr’ package using the following commands:

> install.packages(“dplyr”)
Installing package into ‘/home/shakthi/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
...
* copying figures
* building package indices
* installing vignettes
* testing if installed package can be loaded from temporary location
* checking absolute paths in shared objects and dynamic libraries
* testing if installed package can be loaded from final location
* testing if installed package keeps a record of temporary installation path
* DONE (dplyr)

> library(dplyr)
Attaching package: ‘dplyr’

Consider the ‘Bank Marketing Data Set’ available from the UCI Machine Learning Repository available at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data set is from a Portuguese banking institution and is available freely for public research use. There are four data sets available, and we will use the read.csv() function to import the data from a bank.csv file into a data frame, as shown below:

> bank <- read.csv(file=”bank.csv”, sep=”;”)

> bank[1:3,]
  age        job marital education default balance housing loan  contact day
1  30 unemployed married   primary      no    1787      no   no cellular  19
2  33   services married secondary      no    4789     yes  yes cellular  11
3  35 management  single  tertiary      no    1350     yes   no cellular  16
  month duration campaign pdays previous poutcome  y
1   oct       79        1    -1        0  unknown no
2   may      220        1   339        4  failure no
3   apr      185        1   330        1  failure no

select

You can use the select() function to choose specific columns from the data frame. In the following example, we select the ‘age’, ‘job’ and ‘description’ fields from the data set:

> bank %>% select(age, job, education)
age job education
1 30 unemployed primary
2 33 services secondary
3 35 management tertiary
4 30 management tertiary
5 59 blue-collar secondary
6 35 management tertiary.

filter

The filter() function is used to produce a subset of the data that matches the input conditions. The bank entries with ‘blue-collar’ jobs alone can be listed as follows:

> bank %>% filter(job == “blue-collar”)
    age         job  marital education default balance housing loan   contact
1  59 blue-collar  married secondary   no  0  yes no  unknown
2  31 blue-collar married secondary no  360  yes yes cellular
3  25 blue-collar  single primary  no  -221  yes no   unknown
4  55 blue-collar  married  primary  no  627 yes no unknown
5  32 blue-collar  married secondary no 2089 yes no cellular
...

arrange

The data from selected columns can be sorted using the arrange() function. In the following example, the bank data is first sorted by ‘age’ and then by the ‘balance’ column:

> bank %>% arrange(age, balance)
     age  job  marital education default balance housing loan
1     19  student   single   unknown  no     0      no   no
2     19  student   single   primary  no     103      no   no
3     19  student   single secondary  no     302      no   no
4     19  student   single   unknown  no    1169      no   no
5     20  student   single secondary  no     291      no   no
6     20  student   single secondary  no     502      no   no
7     20  student   single secondary  no    1191      no   no
8     21  student   single secondary  no       6      no   no

relocate

The relocate() function is used to change the column positions in the output. You can use the ‘.before’ and ‘.after’ arguments to specify the location of the columns. For example:

> bank %>% relocate(age, .after = “job”)
    job age  marital education default balance housing loan
1   unemployed  30  married   primary   no    1787   no   no
2   services  33  married secondary     no    4789  yes  yes
3   management  35   single  tertiary   no    1350  yes   no
4   management  30  married  tertiary no   1476     yes  yes
5   blue-collar  59  married secondary  no    0     yes   no
6   management  35   single  tertiary   no   747      no   no
...

count

The unique values for a column can be computed using the count() function, as shown below:

> bank %>% count(age)
   age   n
1   19   4
2   20   3
3   21   7
4   22   9
5   23  20
6   24  24
7   25  44
8   26  77
9   27  94

tally

The tally() method is a low-level function that can also be used to count unique values for a column. The total number of bank entries in the CSV file is 4521, as indicated below:

> bank %>% tally()
     n
1 4521

distinct

The unique rows in a data frame can be obtained using the distinct() function. The three possible values for marital status are shown below:

> bank %>% distinct(marital) 
   marital
1  married
2   single
3 divorced

mutate

The mutate() function is used to create new columns from existing data. The balance field is in Euros, and a new USD column is computed based on the Euro-USD conversion rate as follows:

> bank %>% select(age, education, balance) %>% mutate(usd = balance * 1.08)
     age education balance      usd
1     30   primary    1787  1929.96
2     33 secondary    4789  5172.12
3     35  tertiary    1350  1458.00
4     30  tertiary    1476  1594.08
5     59 secondary       0     0.00
6     35  tertiary     747   806.76
...

pull

The pull() function accepts a numeric argument for the column number and returns its values. The last column in the data set is indexed at ‘-1’ and corresponds to whether the client had subscribed to a term deposit.

> bank %>% pull(-1)
   [1] “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no” 
  [13] “no”  “yes” “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no” 
  [25] “no”  “no”  “no”  “no”  “no”  “no”  “yes” “no”  “no”  “yes” “yes” “no” 
  [37] “yes” “yes” “yes” “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”  “no”

group_by

The data set can be categorised using the group_by() function, as demonstrated below:

> bank %>% group_by(age)
# A tibble: 4,521 × 17
# Groups:   age [67]
     age job       marital education default balance housing loan  contact   day
   <int> <chr>     <chr>   <chr>     <chr>     <int> <chr>   <chr> <chr>   <int>
 1    30 unemploy… married primary   no         1787 no      no    cellul…    19
 2    33 services  married secondary no         4789 yes     yes   cellul…    11
 3    35 manageme… single  tertiary  no         1350 yes     no    cellul…    16
 4    30 manageme… married tertiary  no         1476 yes     yes   unknown     3
 5    59 blue-col… married secondary no            0 yes     no    unknown     5
 6    35 manageme… single  tertiary  no          747 no      no    cellul…    23
 7    36 self-emp… married tertiary  no          307 yes     no    cellul…    14
 8    39 technici… married secondary no          147 yes     no    cellul…     6
 9    41 entrepre… married tertiary  no          221 yes     no    unknown    14
10    43 services  married primary   no          -88 yes     yes   cellul…    17
# ℹ 4,511 more rows

summarise

You can create a new data frame by also grouping variables using the summarise() function. The summarize() name can also be used instead.

> bank %>% summarise(age, balance)
     age balance
1     30    1787
2     33    4789
3     35    1350
4     30    1476
5     59       0

glimpse

The entire data frame can be transposed using the glimpse() function, which shows every column as a row. It tries to show as much of the data as possible in the output.

> bank %>% glimpse()
Rows: 4,521
Columns: 17
$ age       <int> 30, 33, 35, 30, 59, 35, 36, 39, 41, 43, 39, 43, 36, 20, 31, …
$ job       <chr> “unemployed”, “services”, “management”, “management”, “blue-…
$ marital   <chr> “married”, “married”, “single”, “married”, “married”, “singl…
$ education <chr> “primary”, “secondary”, “tertiary”, “tertiary”, “secondary”,…
$ default   <chr> “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, “no”, …
$ balance   <int> 1787, 4789, 1350, 1476, 0, 747, 307, 147, 221, -88, 9374, 26…

slice

You can select the first three rows by specifying them with the slice() function. It allows you to select, filter and show duplicate rows as well. For example:

> bank %>% slice(1:3)
  age        job marital education default balance housing loan  contact day
1  30 unemployed married   primary      no    1787      no   no cellular  19
2  33   services married secondary      no    4789     yes  yes cellular  11
3  35 management  single  tertiary      no    1350     yes   no cellular  16
  month duration campaign pdays previous poutcome  y
1   oct       79        1    -1        0  unknown no
2   may      220        1   339        4  failure no
3   apr      185        1   330        1  failure no
desc 
The desc() function is used to sort a column in the descending order. The bank balances can be displayed from the highest to the lowest values, as shown below: 

> bank %>% arrange(desc(balance))
   age    job  marital education default balance housing loan
1     60  retired  married   primary   no   71188    no   no
2     42  entrepreneur  married  tertiary no  42045 no   no
3     43  technician   single  tertiary  no  27733  yes   no
4     36   management  married  tertiary no  27359  yes   no
5     57   technician  married  tertiary no  27069   no  yes
6     31   housemaid   single   primary  no  26965      no   no
...

nth

A specific row from the data set can be retrieved using the nth() function. The 10th entry in the bank data frame is shown below:

> bank %>% nth(10)
age job marital education default balance housing loan contact day
1 43 services married primary no -88 yes yes cellular 17
month duration campaign pdays previous poutcome y
1 apr 313 1 147 2 failure no

You are encouraged to read the dplyr package manual to learn more functions, arguments and usage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here