R Series: Probability Distributions

0
237
Probability-Distributions

This fifteenth article in the R series will explore more probability distributions and introduce statistical tests.

We will use R version 4.2.1 installed on Parabola GNU/Linux-libre (x86-64) for the code snippets.

$ R --version

R version 4.2.1 (2022-06-23) -- “Funny-Looking Kid”
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

Welch two sample t-test

Consider the mtcars data set in the lattice library:

> head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

The displacement values are shown below:

> mtcars$disp
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0

We can prepare two data sets, P and Q, from the displacement data, as follows:

> P <- mtcars$disp[0:9]
> P
[1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8

> Q <- mtcars$disp[10:19]
> Q
[1] 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7

The boxplot for the data sets is shown in Figure 1:

Figure 1: Boxplot mtcars$disp
Figure 1: Boxplot mtcars$disp
> boxplot(P, Q)

The Welch two sample t-test can be performed using the t.test() function, as demonstrated below:

> t.test(P, Q)

	Welch Two Sample t-test

data:  P and Q
t = -0.98055, df = 15.369, p-value = 0.342
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -176.63004   65.16337
sample estimates:
mean of x mean of y 
 213.1667  268.9000

We can also run the t-test assuming equal variances by passing TRUE to the ‘var.equal’ argument, as follows:

> t.test(P, Q, var.equal=TRUE)

	Two Sample t-test

data:  P and Q
t = -0.95746, df = 17, p-value = 0.3518
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -178.54460   67.07793
sample estimates:
mean of x mean of y 
 213.1667  268.9000

Two sample Wilcoxon test

The Mann-Whitney-Wilcoxon test is used for randomly selected values from two populations. For our example, we can run the wilcox.test() function on the P and Q data sets, as follows:

> wilcox.test(P, Q)

	Wilcoxon rank sum test with continuity correction

data:  P and Q
W = 32, p-value = 0.3059
alternative hypothesis: true location shift is not equal to 0

Warning message:
In wilcox.test.default(P, Q) : cannot compute exact p-value with ties

The empirical CDFs for the data sets can be drawn using the plot() function:

Figure 2: Empirical CDF (P)
Figure 2: Empirical CDF (P)
> plot (ecdf(P), verticals=TRUE)
> plot (ecdf(Q), verticals=TRUE)
Figure 3: Empirical CDF (Q)
Figure 3: Empirical CDF (Q)

The wilcox.test() function accepts the following arguments:

Arguments Description
x A numeric vector of data values
y Optional numeric vector of data values
alternative Character string for alternative hypothesis (‘two.sided’, ‘greater’, or ‘less’)
conf.int A logical value that indicates if confidence interval should be computed
conf.level Confidence level of the interval
correct A logical value on whether to apply continuity correction
data An optional matrix or data frame
exact A logical value on whether p-value should be computed
mu Optional parameter for the null hypothesis
na.action A function to handle data that contains NAs
paired A logical value to indicate a paired test
subset An optional vector specifying a subset of observations
tol.root A positive numeric tolerance

Kolmogrov-Smirnov test

The Kolmogrov-Smirnov test or KS test is used to test if two samples come from the same distribution. It can also be modified to be used as a ‘goodness of fit’ test. The two-sample Kolmogrov-Smirnov test for the P and Q data sets is as follows:

> ks.test(P, Q)

	Exact two-sample Kolmogorov-Smirnov test

data:  P and Q
D = 0.37778, p-value = 0.2634
alternative hypothesis: two-sided

You can also perform a one-sample Kolmogrov-Smirnov test. The function accepts the following arguments:

Arguments Description
x Numeric vector of data values
y Numeric vector or character string name of a CDF
alternative Alternate hypothesis should be ‘two-sided’, ‘less’ or ‘greater’
exact NULL or logical value to specify if p-value should be computed
simulate.p.value A logical value to compute p-values by Monte Carlo simulation
data An optional data frame
subset A subset vector of data to be used
na.action A function to handle data that
contains NAs

The alternative hypothesis argument accepts the string argument values: ‘two.sided’, ‘less’, and ‘greater’. These indicate if the null hypothesis of the distribution of ‘x’ is equal to, less than, or greater than the distribution of ‘y’. Any missing values are ignored from both the samples.

Chi-Square test

The Chi-Square test is a statistical hypothesis test to determine the correlation between two categorical variables. We can determine if the cylinder and horse power variables from the mtcars data set are significantly correlated.

> data <- table(mtcars$cyl, mtcars$hp)
> print(data)
   
    52 62 65 66 91 93 95 97 105 109 110 113 123 150 175 180 205 215 230 245 264
  4  1  1  1  2  1  1  1  1   0   1   0   1   0   0   0   0   0   0   0   0   0
  6  0  0  0  0  0  0  0  0   1   0   3   0   2   0   1   0   0   0   0   0   0
  8  0  0  0  0  0  0  0  0   0   0   0   0   0   2   2   3   1   1   1   2   1

The chisq.test() function can be performed on the data variables, as shown below:

> chisq.test(data)

	Pearson’s Chi-squared test

data:  data
X-squared = 59.429, df = 42, p-value = 0.0393

A p-value of less than 0.05 shows that there is a strong correlation between cylinder and horsepower.

The chisq.test function accepts the following arguments:

Arguments Description
x A numeric vector or matrix
y A numeric vector, or ignored by x is a matrix
correct A logical value on whether to apply continuity correction
p A vector of probabilities
rescale.p A logical scalar to sum to 1
simulate.p.value A logical value to whether compute p-values by Monte Carlo simulation

Beta

The Beta function is mathematically defined by the integral for two inputs, alpha and beta, as shown below:

Figure 4: Beta distribution
Figure 4: Beta distribution

You can use the beta() function for the P and Q data sets, as follows:

> beta(P, Q)
 [1] 7.307338e-100 7.307338e-100 2.515827e-100 5.970038e-162 2.168131e-190
 [6] 8.246568e-192 1.139582e-245 1.211659e-144  2.186038e-63  1.931142e-65

The lbeta() function provides the natural logarithm of the beta function, as demonstrated below:

> lbeta(P, Q)
 [1] -228.2696 -228.2696 -229.3359 -371.2320 -436.7173 -439.9865 -564.0027
 [8] -331.3803 -144.2808 -149.0099

The dbeta() function provides the density distribution. An example is given below with a plot (Figure 5).

Figure 5: DBeta distribution
Figure 5: DBeta distribution
> p = seq(0, 5, length=50)
> plot(p, dbeta(p, 3, 5), type=’l’)

Gamma

The Gamma distribution is a continuous probability distribution that has a shape (z) and scale parameter (k), which is defined as shown in Figure 6.

Figure 6: Gamma distribution
Figure 6: Gamma distribution

The dgamma() function provides the density function. In the following example, a sequence of values is produced and a plot for its gamma distribution is generated.

> a <- seq(0, 50, by=0.01)
> y <- dgamma(a, shape=3)
> plot(y)

The lgamma() function provides the natural logarithm of the absolute value of the gamma function. For example:

Figure 7: Gamma function
Figure 7: Gamma function
> l <- lgamma(a)
> head(l)
[1] Inf 4.599480 3.900805 3.489971 3.197078 2.968879

These functions accept the following arguments:

Arguments Description
a, b Positive numeric vectors
x, n Numeric vectors
k, d Integer vectors

LEAVE A REPLY

Please enter your comment!
Please enter your name here