Transforming Data with R

0
608
r programming

In this fifth article in the ‘R, Statistics and Machine Learning’ series, we shall learn the various R functions that are available to combine, modify, select and apply functions on data.

We will be using R version 4.1.0 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets given here.

$  R --version
R version 4.1.0 (2021-05-18) -- “Camp Pontanezen”
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
R is free software and comes with absolutely no warranty.

You are welcome to redistribute it under the terms of the GNU General Public License versions 2 or 3. For more information about these matters, see https://www.gnu.org/licenses/.

> a <- c(“1.”, “2.”, “3.”, “4.”)
> b <- c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”)
> paste(a, b)
[1] “1. Chapter: Introduction” “2. Chapter: Trees”       
[3] “3. Chapter: Graphs”       “4. Chapter: Networks”

You can also pass a separator string to combine the character vectors. In the following example, we use the ‘.’ (dot) to combine strings.

You can also pass a separator string to combine the character vectors. In the following example, we use the ‘.’ (dot) to combine strings. 

> a <- c(“1”, “2”, “3”, “4”)
> b <- c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”)
> paste(a, b, sep=”. “)
[1] “1. Chapter: Introduction” “2. Chapter: Trees”       
[3] “3. Chapter: Graphs”       “4. Chapter: Networks”

Consider the mtcars data set available in R. You can add a column to the data frame or matrix, using the cbind() function, as shown below:

> head(mtcars)
mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4    21.0  6  160 110 3.90 2.620 16.46  0  1  4  . 4
Mazda RX4 Wag  21.0  6  160 110 3.90 2.875 17.02  0  1  4   4
Datsun 710  22.8  4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive  21.4  6 258 110 3.08 3.215 19.44  1  0   3  1
Hornet Sportabout 18.7  8 360 175 3.15 3.440 17.02  0  0  3  2
Valiant  18.1  6  225 105 2.76 3.460 20.22  1  0    3    1
 
> data <- head(mtcars)
> year <- c(1970, 1970, 1973, 1974, 1977, 1962)
> new_mtcars <- cbind(data, year)
> new_mtcars
mpg cyl disp  hp drat    wt  qsec vs am gear carb year
Mazda RX4  21.0  6  160 110 3.90 2.620 16.46  0  1  4  4 1970
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4  4 1970
Datsun 710 22.8  4 108  93 3.85 2.320 18.61  1  1 4  1 1973
Hornet 4 Drive 21.4 6  258 110 3.08 3.215 19.44 1 0  3 1 1974
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0  0 3 2 1977
Valiant 18.1 6 225 105 2.76 3.460 20.22  1  0    3    1 1962

If you would like to add a new row to the data, use the rbind() function, as follows:

> v <- data.frame(21.4, 4, 121.0, 109, 4.11, 2.780, 18.60, 1, 1, 4, 2, 1966)
> names(v) <- c(“mpg”, “cyl”, “disp”, “hp”, “drat”, “wt”, “qsec”, “vs”, “am”, “gear”, “carb”, “year”)
> rownames(v) <- c(“Volvo 142E”)
> rbind(new_mtcars, v)
        mpg cyl disp  hp drat   wt  qsec vs am gear carb year
Mazda RX4  21.0  6  160 110 3.90 2.620 16.46  0  1 4   4 1970
Mazda RX4 Wag 21.0  6 160 110 3.90 2.875 17.02  0 1 4  4 1970
Datsun 710 22.8 4 108  93 3.85 2.320 18.61  1  1  4    1 1973
Hornet 4 Drive 21.4  6 258 110 3.08 3.215 19.44 1 0  3 1 1974
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 1977
Valiant 18.1  6  225 105 2.76 3.460 20.22  1  0   3    1 1962
Volvo 142E 21.4 4 121 109 4.11 2.780 18.60 1  1   4    2 1966

There also exists the merge() function that combines two data sets. In the following example, the ‘data’ and ‘new_mtcars’ data sets are merged and sorted on the mpg column in descending order. The merge function uses common variables for its operation.

> merge(data, new_mtcars)
   mpg cyl disp  hp drat    wt  qsec vs am gear carb year
1 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1962
2 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 1977
3 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 1970
4 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 1970
5 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1974
6 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1973

The merge() function allows the following arguments:

  • x: A data frame to combine with.
  • y: Another data frame to combine with.
  • by: The vector of column names for merge.
  • by.x: The column names in x to be used for combining the data.
  • by.y: The column names in y to be used for the merge operation.
  • sort: A Boolean value on whether to sort the results or not.
  • incomparables: A list of variables that cannot be compared.

The intersect() function returns the common variables between two vectors. For example:

> intersect(data, new_mtcars)
			mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4  21.0  6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag  21.0 6  160 110 3.90 2.875 17.02  0 1  4    4
Datsun 710 22.8 4 108  93 3.85 2.320 18.61  1  1  4  1 Hornet 4 Drive 21.4  6 258 110 3.08 3.215 19.44 1 0  3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0  0 3    2
Valiant 18.1  6  225 105 2.76 3.460 20.22  1  0    3    1

You can combine vectors or data frames into a single data frame using the make.groups() function, as shown below:

> make.groups(new_mtcars$gear, new_mtcars$cyl)
                 data           which
new_mtcars$gear1    4 new_mtcars$gear
new_mtcars$gear2    4 new_mtcars$gear
new_mtcars$gear3    4 new_mtcars$gear
new_mtcars$gear4    3 new_mtcars$gear
new_mtcars$gear5    3 new_mtcars$gear
new_mtcars$gear6    3 new_mtcars$gear
new_mtcars$cyl1     6  new_mtcars$cyl
new_mtcars$cyl2     6  new_mtcars$cyl
new_mtcars$cyl3     4  new_mtcars$cyl
new_mtcars$cyl4     6  new_mtcars$cyl
new_mtcars$cyl5     8  new_mtcars$cyl
new_mtcars$cyl6     6  new_mtcars$cyl

Transform
Computation can be performed to update existing values or add new data. For example, the qsec column in the mtcars data set represents the time in seconds to reach a quarter mile. You can convert the data into minutes using the following calculation:

> data$minsec <- (data$qsec / 60)
> data
		mpg cyl disp  hp drat wt qsec vs am gear carb   minsec
Mazda RX4 21.0  6 160 110 3.90 2.620 16.46  0  1 4  4 0.2743333
Mazda RX4 Wag 21.0  6 160 110 3.90 2.875 17.02  0 1 4 4 0.2836667
Datsun 710  22.8  4  108  93 3.85 2.320 18.61  1  1  4 1 0.3101667
Hornet 4 Drive   21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.3240000
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.2836667
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.3370000

The transform() function also exists to change variables in a data frame. You can update multiple columns with the transform() function, as shown below:

> transform(new_mtcars, wt=wt*0.4545, year=2021-year)
        mpg cyl disp  hp drat    wt  qsec vs am gear carb year
Mazda RX4  21.0  6 160 110 3.90 1.190790 16.46  0  1  4  4   51
Mazda RX4 Wag  21.0  6 160 110 3.90 1.306688 17.02 0 1  4 4  51
Datsun 710  22.8  4 108  93 3.85 1.054440 18.61  1  1  4  1  48
Hornet 4 Drive 21.4 6 258 110 3.08 1.461218 19.44  1 0  3  1 47
Hornet Sportabout 18.7   8  360 175 3.15 1.563480 17.02  0  0    3    2   44
Valiant  18.1  6 225 105 2.76 1.572570 20.22  1  0  3   1   59

The apply() function can be used on an array or matrix. It takes three arguments — an array, dimensions, and a function. The function is applied to the specific dimensions of the array or matrix. In the following example, the rows (MARGIN=1) represent the first dimension while the columns (MARGIN=2) represent the second dimension. The max() and min() functions are used to return the maximum and minimum values respectively.

> a <- array(c(1, 2, 3, 4, 5, 6, 7, 8, 9), dim=c(3, 3))
> a
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
 
> apply(X=a, MARGIN=1, FUN=max)
[1] 7 8 9
> apply(X=a, MARGIN=2, FUN=max)
[1] 3 6 9
> apply(X=a, MARGIN=1, FUN=min)
[1] 1 2 3
> apply(X=a, MARGIN=2, FUN=min)
[1] 1 4 7

If you would like to return a list, you can use the lapply() function to a vector or list, as shown below:

> l = list(1, 3, 5, 7)
> lapply(l, function(x) x*x)
[[1]]
[1] 1
[[2]]
[1] 9
[[3]]
[1] 25
[[4]]
[1] 49

On the other hand, if you would like to return a vector, matrix or array, you can use the sapply() function.

> sapply(l, FUN=function(x) x*x)
[1]  1  9 25 49

The mapply() function is a multivariate version of the sapply() function. For example:

> mapply(paste, c(“1”, “2”, “3”, “4”),
+               c(“. “, “. “, “. “, “. “),
+               c(“Chapter: Introduction”, “Chapter: Trees”, “Chapter: Graphs”, “Chapter: Networks”))
                           1                            2 
“1 .  Chapter: Introduction”        “2 .  Chapter: Trees” 
                           3                            4 
      “3 .  Chapter: Graphs”     “4 .  Chapter: Networks” 
>

The following arguments are supported by the mapply() function.

  • FUN: The function to be applied to the data.
  • …: A list of vectors on which the function should be applied.
  • MoreArgs: Additional arguments to the function.
  • SIMPLIFY: A Boolean value on whether to simplify the result.
  • USE.NAMES: A Boolean value on whether to use names for the values.

Select
The shingle() function can group data into bins. You will need to load the lattice library in order to use the function. For example, the various cars in ‘new_mtcars’ are grouped based on 4, 6, or 8 cylinders, as indicated below:

> library(lattice)
> shingle(new_mtcars$cyl)
 
Data:
[1] 6 6 4 6 8 6
 
Intervals:
  min max count
1   4   4     1
2   6   6     4
3   8   8     1
 
Overlap between adjacent intervals:
[1] 0 0

A list of discrete factors is returned from a continuous numerical vector using the cut() function. The horsepower in new_mtcars data is broken into two groups, as follows:

> cut(new_mtcars$hp,breaks=2)
[1] (92.9,134] (92.9,134] (92.9,134] (92.9,134] (134,175]  (92.9,134]
Levels: (92.9,134] (134,175]

The cut() function accepts the following arguments:

  • x: A numeric vector.
  • breaks: The number of points to split the data into.
  • labels: The labels for the factor levels.
  • include.lowest: A Boolean value on whether to include the smallest value in the bin.
  • right: A Boolean value on whether the interval should be open on the left and closed on the right.

The bracket notation can be used on the data frame to filter the results. For example, the cars that have more than five cylinders are listed below:

        mpg cyl disp  hp drat   wt  qsec vs am gear carb year
Mazda RX4  21.0  6  160 110 3.90 2.620 16.46  0  1  4  4 1970
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02  0  1 4  4 1970
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1  0 3  1 1974
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0  3 2 1977
Valiant 18.1 6 225 105 2.76 3.460 20.22  1  0    3    1 1962

You can also use the subset() function to select the above data:

> subset(new_mtcars, cyl>5)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb year
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 1970
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 1970
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1974
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 1977
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1962

A random sample from the data set can be obtained using the sample() function. For example:

> sample(new_mtcars, 2)
                   qsec cyl
Mazda RX4         16.46   6
Mazda RX4 Wag     17.02   6
Datsun 710        18.61   4
Hornet 4 Drive    19.44   6
Hornet Sportabout 17.02   8
Valiant           20.22   6
 
> sample(new_mtcars, 2)
                  disp  mpg
Mazda RX4          160 21.0
Mazda RX4 Wag      160 21.0
Datsun 710         108 22.8
Hornet 4 Drive     258 21.4
Hornet Sportabout  360 18.7
Valiant            225 18.1

Summary
Consider the ‘Bank Marketing Data Set’ available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The data is from a Portuguese banking institution and is available freely for public research use. There are four data sets available, and we will use the read.csv() function to import the data from a ‘bank.csv’ file into a data frame.

> bank <- read.csv(file=”bank.csv”, sep=”;”)
 
> bank[1:3,]
  age        job marital education default balance housing loan  contact day
1  30 unemployed married   primary      no    1787      no   no cellular  19
2  33   services married secondary      no    4789     yes  yes cellular  11
3  35 management  single  tertiary      no    1350     yes   no cellular  16
  month duration campaign pdays previous poutcome  y
1   oct       79        1    -1        0  unknown no
2   may      220        1   339        4  failure no
3   apr      185        1   330        1  failure no

The tapply() function can be used to provide the summary of the bank balances for the various job categories, as shown below:

> tapply(X=bank$balance, INDEX=list(bank$job), FUN=sum)
       admin.   blue-collar  entrepreneur     housemaid    management 
       586380       1026563        276381        233386       1712154 
      retired self-employed      services       student    technician 
       533414        254811        460350        129681       1022205 
   unemployed       unknown 
       139446         57065

You can also use the aggregate() function to produce the above result, and it is more suited for time-series data.

> aggregate(x=bank$balance, by=list(bank$job), FUN=sum)
         Group.1       x
1         admin.  586380
2    blue-collar 1026563
3   entrepreneur  276381
4      housemaid  233386
5     management 1712154
6        retired  533414
7  self-employed  254811
8       services  460350
9        student  129681
10    technician 1022205
11    unemployed  139446
12       unknown   57065

The aggregate() function accepts the following arguments:

  • x: The object to apply the summary on.
  • by: A list of elements to be categorised.
  • FUN: The function to use to compute the statistic.
  • nfrequency: The number of observations per unit time.
  • …: Any arguments passed to the function.

The rowsum() function can sum the variables in a data set. For example, the sum of the ‘day’ column grouped by the job category is shown below:

> rowsum(x=bank$day, group=bank$job)
               [,1]
admin.         7803
blue-collar   14646
entrepreneur   2563
housemaid      1713
management    15751
retired        3578
self-employed  2961
services       6470
student        1377
technician    12429
unemployed     2060
unknown         602

You can count the number of observations for each value using the tabulate() function, as follows:

> tabulate(bank$previous)
 [1] 286 193 113  78  47  25  22  18  10   4   3   5   1   2   1   0   1   1   1
[20]   1   0   1   1   1   1

The table() function can also provide a count of the observed data based on categories. For example, for the bank data set, we see the classification of people based on their marital status.

> table(bank$marital)
divorced  married   single 
     528     2797     1196

The duplicated() and unique() functions show any repetitive and distinct values for the data. A couple of examples are shown below:

> duplicated(new_mtcars$cyl)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE
 
> unique(bank$job)
 [1] “unemployed”    “services”      “management”    “blue-collar”  
 [5] “self-employed” “technician”    “entrepreneur”  “admin.”       
 [9] “student”       “housemaid”     “retired”       “unknown”    #+END_SRC

The summary results can be sorted using the sort() function.

> r <- rowsum(x=bank$day, group=bank$job) 
> r
               [,1]
admin.         7803
blue-collar   14646
entrepreneur   2563
housemaid      1713
management    15751
retired        3578
self-employed  2961
services       6470
student        1377
technician    12429
unemployed     2060
unknown         602
 
> sort(r)
 [1]   602  1377  1713  2060  2563  2961  3578  6470  7803 12429 14646 15751

You can also specify the ordering required for a specific column in the data using the order() function:

> data[order(data$hp), ]
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb    minsec
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 0.3101667
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 0.3370000
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.2743333
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.2836667
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 0.3240000
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.2836667

You are encouraged to read the R documentation for the above functions and try it out on your data sets.

LEAVE A REPLY

Please enter your comment!
Please enter your name here