R Series: Optimization

0
433
Optimizing

We have learnt various R profiling techniques in the previous article. In this twentieth article in the R series, we shall learn how to optimise R code.

We will use version 4.2.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version
R version 4.2.2 (2022-10-31) -- “Innocent and Trusting”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

Recursive versus non-recursive functions

Consider a recursive function to compute the nth Fibonacci sequence:

recurFib <- function(n) {
if (n == 0) {0}
else if (n == 1 || n == 2) {1}
else { recurFib(n - 1) + recurFib(n - 2) }
}

You can verify the output sequence numbers, as shown below:

> recurFib(0)
[1] 0
> recurFib(1)
[1] 1
> recurFib(2)
[1] 1
> recurFib(3)
[1] 2
> recurFib(4)
[1] 3
> recurFib(5)
[1] 5

A non-recursive implementation is given by the following fibo() function:

> fibo <- function(n) {
+ phi <- (sqrt(5) + 1)/2
+ fib <- (phi^(n+1) - (1-phi)^(n+1)) / (2*phi - 1)
+ round(fib)
+ }

In this example, we are starting the Fibonacci number with 1, as illustrated below:

> fibo(0)
[1] 1
> fibo(1)
[1] 1
> fibo(2)
[1] 2
> fibo(3)
[1] 3
> fibo(4)
[1] 5
> fibo(5)
[1] 8

The system.time() function can be used to obtain the computation time taken by both these functions.

> recurFib(35)
[1] 9227465
> system.time(recurFib(35))
user system elapsed
6.831 0.000 6.834
> fibo(34)
[1] 9227465
> system.time(fibo(34))
user system elapsed
0 0 0

The recursive function is definitely slower than the non-recursive implementation. Moreover, R uses faster native implementation for computing power of values (^), and hence its usage is recommended.

Preallocate

You do not need to allocate memory for any object in R. In the following example, the initialisation takes time.

> v1 <- c()
> system.time(for (i in 1:10000000) {v1[i] <- i;})
user system elapsed
2.143 0.120 2.263

The length can be specified for a vector and memory can be pre-allocated, so that initialisation can be sped up, as illustrated below:

> v2 <- c(NA)
> length(v2) <- 10000000
> system.time(for (i in 1:10000000) {v2[i] <- i;})
user system elapsed
0.299 0.020 0.319

Matrix multiplication

Consider two matrices, m1 and m2:

> m1 <-matrix(1:10000, nrow=100)
> m2 <-matrix(10001:20000, nrow=100)

You can write your own version of matrix multiplication as follows:

> matrixMul <- function(mat1, mat2) {
rows <- nrow(mat1)
cols <- ncol(mat2)
output <- matrix(nrow = rows, ncol = cols)
for (i in 1:rows) {
for(j in 1:cols) {
vec1 <- mat1[i,]
vec2 <- mat1[,j]
mult1 <- vec1 * vec2
output[i,j] <- sum(mult1)
}
}
return(output)
}

The time taken for multiplying the two matrices, m1 and m2, using the above function is given below:

> system.time(matrixMul(m1,m2))
user system elapsed
0.039 0.000 0.039

The recommended practice is to perform matrix multiplication using the ‘%*%’ operator. The result for the multiplication of matrices m1 and m2 is quite instantaneous.

> system.time(m1*m2)
user system elapsed
0 0 0

Bytecode compiler

The cmpfun() function provides an interface to the byte code compiler for R. It takes the following arguments:

Argument Description
f A closure
options Compiler options

The ‘recurFib’ function can be byte-compiled to improve the performance, as shown below:

> library(compiler)
> compiled.recurFib <- cmpfun(recurFib)
> system.time(compiled.recurFib(34))
user system elapsed
4.296 0.000 4.298

The compiled function, along with the reference to the bytecode, is shown below:

> compiled.recurFib
function(n) {
if (n == 0) {0}
else if (n == 1 || n == 2) {1}
else { recurFib(n - 1) + recurFib(n - 2) }
}
<bytecode: 0x55eff06fcba8>

You can also view the disassembly of the compiled function using the disassemble() function, as follows:

> disassemble(compiled.recurFib)
list(.Code, list(12L, GETVAR.OP, 1L, LDCONST.OP,3L,EQ.OP, 4L,
BRIFNOT.OP, 5L, 13L, LDCONST.OP, 3L,RETURN.OP, GETVAR.OP,
1L, LDCONST.OP, 8L, EQ.OP, 9L, OR1ST.OP,7L,30L, GETVAR.OP,
1L, LDCONST.OP, 10L, EQ.OP, 11L, OR2ND.OP, 7L, BRIFNOT.OP,
12L, 36L, LDCONST.OP, 8L, RETURN.OP, GETFUN.OP, 14L, MAKEPROM.OP,
17L, CALL.OP, 15L, GETFUN.OP, 14L, MAKEPROM.OP, 19L, CALL.OP,
18L, ADD.OP, 20L, RETURN.OP), list({
if (n == 0) {
0
}
else if (n == 1 || n == 2) {
1
}
else {
recurFib(n - 1) + recurFib(n - 2)
}
}, n, structure(c(2L, 3L, 4L, 44L, 3L, 44L, 2L, 4L), srcfile = <environment>, class = “srcref”),
0, n == 0, if (n == 0) {
0
...

The cmpfile() function can be used to compile a large source .R file. It accepts the following arguments:

Argument Description
infile  A closure
outfile Compiler options
ascii If compiled file should be stored as ascii
env Environment for compiling
verbose Display compilation output
options Compiler options
version Workspace format version

 

Bigmemory package

The bigmemory package is used to handle large, Big Data matrices using shared memory and memory-mapped files. You can install the package as follows:

> install.packages(“bigmemory”)
Installing package into ‘/home/guest/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
...
Creating a generic function for ‘typeof’ from package ‘base’ in package ‘bigmemory’
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (bigmemory)

The ‘m’ matrix is created using the ‘matrix’ function and written to a .csv file using the ‘write.table()’ function, as shown below:

> library(bigmemory)
> m <- matrix(rnorm(30*15000, mean=0, sd=4), 30, 15000)
> write.table(m, file = “/tmp/m1.csv”)
> system.time(write.table(m, file = “/tmp/m1.csv”))
user system elapsed
0.318 0.001 0.318

You can instead use the big.matrix() function from the bigmemory package to improve the write performance, as demonstrated below:

> system.time(big.matrix(m, nrow=30, ncol=15000, backingfile=”m2.csv”, type=”double”, descriptorfile=”m2.desc”))
user system elapsed
0.006 0.000 0.006

The contents of the m2.csv file are given below for reference:

$ cat m2.desc
new(“big.matrix.descriptor”, description = list(sharedType = “FileBacked”,
filename = “m2.csv”, dirname = “/home/guest/20-optimization/”,
totalRows = 30L, totalCols = 15000L, rowOffset = c(0, 30),
colOffset = c(0, 15000), nrow = 30, ncol = 15000, rowNames = NULL,
colNames = NULL, type = “double”, separated = FALSE))

The R code optimizer

The rco package takes in R code as input, and applies various optimization techniques to improve the code. You can install the package using the following command:

> install.packages(“rco”)
Installing package into ‘/home/guest/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
--- Please select a CRAN mirror for use in this session ---
...
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (rco)

The formula to convert Fahrenheit to Celsius is given below:

celsius = (fahrenheit - 32) / 1.8

The code snippets are stored in a ‘code’ variable, as shown below:

> library(rco)
> code <- paste(
+ “f <- 100”,
+ “paren <- (f - 32)”,
+ “v <- paren / 1.8”,
+ “c <- v”,
+ “if (c > 30) {“,
+ ‘ print(“It is hot!”)’,
+ “} else {“,
+ ‘ print(“It is pleasant!”)’,
+ “}”,
+ sep=”\n”
+ )

You can invoke the optimize_code() function on the code and view the optimised code, as illustrated below:

> pass1 <- optimize_text(code, iterations = 1)
Optimization number 1

> pass1
[1] “f <- 100\nparen <- (100 - 32)\nv <- paren / 1.8\nc <- v\nif (c > 30) {\n print(\”It is hot!\”)\n} else {\n print(\”It is pleasant!\”)\n}”
> pass2 <- optimize_text(pass1, iterations = 1)
Optimization number 1

> pass2
[1] “f <- 100\nparen <- 68\nv <- 68 / 1.8\nc <- v\nif (c > 30) {\n print(\”It is hot!\”)\n} else {\n print(\”It is pleasant!\”)\n}”

The ‘optimize_text’ function accepts the following arguments:

Argument Description
text Code in character vectors
optimizers List of optimizer functions
iterations Number of times the optimizer should run

 

You are encouraged to read the ‘Help’ pages of the above R functions to learn more about their arguments, options and usage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here