R Series: mlr3 Package

0
353
mlr3

In this 24th post in our ongoing exploration of the R series, we dive into the mlr3 package.

The mlr3 package offers an object-oriented programming implementation for machine learning in R. It is a successor to the mlr package, and has only the essential building blocks for machine learning. It moves away from R’s S3 classes and uses the R6 encapsulated classes with reference semantics. The software contains only computational code and algorithms, and leaves the visualisation to extra packages. The return types are documented, and type safety is provided. It uses data.table for representing tabular data and is highly efficient and performant. The number of package dependencies is greatly reduced, and hence installation and maintenance are easier compared to the mlr package. We will use R version 4.2.2 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version

R version 4.2.2 (2022-10-31) -- “Innocent and Trusting”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

To begin, you can install and load the ‘mlr3’ package using these commands:

> install.packages(“mlr3”)

Installing package into ‘/home/shakthi/R/x86_64-pc-linux-gnu-library/4.1’

(as ‘lib’ is unspecified)
...
*** installing help indices
*** copying figures
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (mlr3)

> library(“mlr3”)

mlr_tasks

The tabular data and metadata information for a machine learning problem are stored in objects called ‘tasks’ in mlr3. There are 11 data sets available in this version of the mlr3 package as shown here:

> mlr_tasks

<DictionaryTask> with 11 stored values
Keys: boston_housing, breast_cancer, german_credit, iris, mtcars,

penguins, pima, sonar, spam, wine, zoo

You can obtain a task from the dictionary using the
tsk() function. An example task for the mtcars data set is given below:

> task_mtcars = tsk(“mtcars”)

> task_mtcars
<TaskRegr:mtcars> (32 x 11): Motor Trends
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt

The boston_housing data set contains information on various houses in Boston such as the proportion of non-retail business acres per town (indus), the percentage of lower status of the population (lstat), index of accessibility to radial highways (rad), and the average number of rooms per dwelling (rm), etc.

> task_boston_housing = tsk(“boston_housing”)

> task_boston_housing
<TaskRegr:boston_housing> (506 x 19): Boston Housing Prices
* Target: medv
* Properties: -
* Features (18):
- dbl (13): age, b, cmedv, crim, dis, indus, lat, lon, lstat, nox,
ptratio, rm, zn
- int (3): rad, tax, tract
- fct (2): chas, town

The data.table function

The data.table() function is implemented in the data.table package for storing tabular data. It is an alternative to R’s data.frame, and provides more functionality and better performance. For example:

> dt = as.data.table(task_boston_housing)

> dt

medv age b chas cmedv crim dis indus lat lon lstat
1: 24.0 65.2 396.90 0 24.0 0.00632 4.0900 2.31 42.2550 -70.9550 4.98
2: 21.6 78.9 396.90 0 21.6 0.02731 4.9671 7.07 42.2875 -70.9500 9.14
3: 34.7 61.1 392.83 0 34.7 0.02729 4.9671 7.07 42.2830 -70.9360 4.03
4: 33.4 45.8 394.63 0 33.4 0.03237 6.0622 2.18 42.2930 -70.9280 2.94
5: 36.2 54.2 396.90 0 36.2 0.06905 6.0622 2.18 42.2980 -70.9220 5.33
...

The as_task_regr function

A regression task needs to be created using the as_task_regr() function for a given data.frame type object. The subset() function is used to retrieve selective data from the data set. In this example, the ‘cylinder’ and ‘gear’ columns from the mtcars data set are selected, and the gear column is chosen as the prediction field:

> mtcars_set = subset(mtcars, select = c(“cyl”, “gear”))

> str(mtcars_set)
‘data.frame’: 32 obs. of 2 variables:
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...

> task_mtcars = as_task_regr(mtcars_set, target = “gear”, id = “car”)

> task_mtcars
<TaskRegr:car> (32 x 2)
* Target: gear
* Properties: -
* Features (1):
- dbl (1): cyl

The head function

The first few entries of the data set can be viewed using
the head() function as illustrated:

> task_mtcars$head()
gear cyl
1: 4 6
2: 4 6
3: 4 4
4: 3 6
5: 3 8
6: 3 6

The summary function

The summary() function can also be used on the data set to display the quartiles, minimum, maximum, median and mean values as illustrated here:

> summary(as.data.table(task_mtcars))
gear cyl
Min. :3.000 Min. :4.000
1st Qu.:3.000 1st Qu.:4.000
Median :4.000 Median :6.000
Mean :3.688 Mean :6.188
3rd Qu.:4.000 3rd Qu.:8.000
Max. :5.000 Max. :8.000

The data function

The complete data for the regression task can be viewed using the data() function as shown here:

> task_mtcars$data()
gear cyl
1: 4 6
2: 4 6
3: 4 4
4: 3 6
5: 3 8
6: 3 6
7: 3 8
8: 4 4
9: 4 4
...

You can also limit the number of rows and columns to be shown as illustrated below:

> task_mtcars$data(rows = c(1, 2, 3), cols = task_mtcars$feature_names)
cyl
1: 6
2: 6
3: 4

Data display with nrow, ncol

The data dimensions can be displayed using nrow and ncol as shown in this snippet:

> c(task_mtcars$nrow, task_mtcars$ncol)
[1] 32 2

The features and the target column names are fetched using feature_names and target_names, respectively, as illustrated here:

> c(Features = task_mtcars$feature_names, Target = task_mtcars$target_names)
Features Target
“cyl” “gear”

The select and filter functions

The select() function can limit the desired features to be used. In the following example, the ‘cylinder’ column is alone chosen from the mtcars regression.

> task_mtcars$select(“cyl”)

> task_mtcars
<TaskRegr:car> (32 x 2)
* Target: gear
* Properties: -
* Features (1):
- dbl (1): cyl

The rows in the data set can be selected using the filter() function as demonstrated below:

> task_mtcars$filter(1:3)

> task_mtcars$data()
gear cyl
1: 4 6
2: 4 6
3: 4 4

The partition function

The partition() function is used to split the data set into training and test data. For the given mtcars data set, we get 21 entries for the training set, and 11 data points for testing.

> subset = partition(task_mtcars)

> subset
$train
[1] 3 5 8 9 10 21 27 30 32 7 12 14 16 22 23 24 29 31 19 26 28

$test
[1] 1 2 4 25 6 11 13 15 17 18 20

Training data

The Learner class provides many machine learning algorithms in R. In this example, we will use the lrn() regression tree learner for the mtcars data set and training data:

> learning = lrn(“regr.rpart”)

> learning$train(task_mtcars, row_ids = subset$train)

> learning$model
n= 21

node), split, n, deviance, yval
* denotes terminal node

1) root 21 637.86950 19.79524
2) cyl>=5 12 74.58250 15.72500 *
3) cyl< 5 9 99.41556 25.22222 *

The predict method

The predict() method can be used for prediction from the training model for the test data as shown below:

> predict = learning$predict(task_mtcars, row_ids = subset$test)

> predict
<PredictionRegr> for 11 observations:
row_ids truth response
1 21.0 15.72500
2 21.0 15.72500
4 21.4 15.72500
---
17 14.7 15.72500
18 32.4 25.22222
20 33.9 25.22222

The autoplot method

The autoplot() method can be used to visualise the prediction object. However, you need to first install the mlr3verse package as shown here:

> install.packages(“mlr3verse”)
Installing package into ‘/home/shakthi/R/x86_64-pc-linux-gnu-library/4.1’
...
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (mlr3verse)

You can then load the ‘mlr3viz’ library, and run the autoplot function on the prediction result. Figure 1 shows the predicted and truth values for the mtcars data set.

Autoplot
Figure 1: Autoplot
> library(mlr3viz)

> predict = learning$predict(task_mtcars, row_ids = subset$test)

> autoplot(predict)

You are encouraged to read the mlr3 package reference manual to learn more about functions, arguments and usage.

LEAVE A REPLY

Please enter your comment!
Please enter your name here