R Series: Data Sets

0
422
R series programming

In this eighteenth article in the R series, we take a look at the various data sets available in R.

The data sets available in R cover a wide range of fields. We have data sets that give information on the performance of automobiles, the approval rating of the presidents of the United States, the magnitude of the earthquakes around Fiji, the number of international passengers travelling in aeroplanes over a certain period, and much more.

mtcars

We have been using the mtcars data set which contains information on automobiles, fuel consumption and performance from the 1974 Motor Trend US magazine. The data frame has 32 entries and has 11 numeric fields, as shown below:

> head(mtcars)
        mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4  21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag 21.0 6  160 110 3.90 2.875 17.02  0  1   4    4
Datsun 710  22.8  4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive 21.4  6  258 110 3.08 3.215 19.44  1  0 3    1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02  0  0 3   2
Valiant    18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The various fields are described below:

Field Description
mpg Miles/gallon
cyl Number of cylinders
disp Displacement
hp Horse power
drat Rear axle ratio
wt Weight in 1000 lbs
qsec 1/4 mile time
vs Engine (0=V-shaped, 1=straight)
am Transmission (0=automatic, 1=manual)
gear Number of gears
carb Number of carburettors

airquality

The airquality data set provides air quality measurements in New York between May and September 1973. The data was obtained by the New York State Department of Conservation and the National Weather Service. The data frame has 153 observations and six variables, as follows:

> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

The description of the numeric fields is as follows:

Field Description
Ozone Ozone in parts per billion
Solar.R Solar radiation in Langleys
Wind In mph
Temp Temperature in degrees Fahrenheit
Month Values: 1-12
Day Values: 1-31

AirPassengers

An example of time series data is provided by the AirPassengers data set given by Box & Jenkins. It contains the monthly totals of international airline passengers (in thousands) between 1949 and 1960. The data set is illustrated below for reference:

> AirPassengers
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1949 112 118 132 129 121 135 148 148 136 119 104 118
1950 115 126 141 135 125 149 170 170 158 133 114 140
1951 145 150 178 163 172 178 199 199 184 162 146 166

presidents
Another example of time series data is the approval rating for the President of the United States for various quarters from 1945 till 1974 in the presidents data set. It has 120 values and has been provided by the Gallup Organisation.

> presidents
     Qtr1 Qtr2 Qtr3 Qtr4
1945   NA   87   82   75
1946   63   50   43   32
1947   35   60   54   55
1948   36   39   NA   NA
1949   69   57   57   51
1950   45   37   46   39
...

Titanic

A summary of the passengers who travelled on the Titanic ship is available in a four-dimensional array categorised by economic status, gender, age and survival in the Titanic data set. The complete information on the fate of the passengers is given below:

> Titanic
, , Age = Child, Survived = No

      Sex
Class  Male Female
  1st     0      0
  2nd     0      0
  3rd    35     17
  Crew    0      0

, , Age = Adult, Survived = No

      Sex
Class  Male Female
  1st   118      4
  2nd   154     13
  3rd   387     89
  Crew  670      3

, , Age = Child, Survived = Yes

      Sex
Class  Male Female
  1st     5      1
  2nd    11     13
  3rd    13     14
  Crew    0      0

, , Age = Adult, Survived = Yes

      Sex
Class  Male Female
  1st    57    140
  2nd    14     80
  3rd    75     76
  Crew  192     20

The variables and their values are described below:

Name Values
Class 1st / 2nd / 3rd / Crew
Gender Male / Female
Age Child / Adult
Survived Yes / No

quakes

The quakes data set provides the locations of 1000 seismic activities. It reports earthquake magnitude scales of MB > 4.0 near Fiji since 1964. A sample output from the data set is given below for reference:

> head(quakes)
     lat   long depth mag stations
1 -20.42 181.62   562 4.8       41
2 -20.62 181.03   650 4.2       15
3 -26.00 184.10    42 5.4       43
4 -17.97 181.66   626 4.1       19
5 -20.42 181.96   649 4.0       11
6 -19.68 184.31   195 4.0       12

The numeric variables and their descriptions are as follows:

Name Description
lat Latitude
lon Longitude
depth In km
mag Richter magnitude
stations Number of stations that reported activity

melanoma

The measurements on patients who had malignant melanoma between 1962 and 1977 are available in the melanoma data set. The tumours were completely removed from these patients by surgery, and measurements were taken. The patients were reviewed until 1977. A sample data set is as follows:

> library(boot)

> head(melanoma)

  time status sex age year thickness ulcer
1   10      3   1  76 1972      6.76     1
2   30      3   1  56 1968      0.65     0
3   35      2   1  41 1977      1.34     0
4   99      3   0  71 1968      2.90     0
5  185      1   1  52 1965     12.08     1
6  204      1   1  28 1971      4.84     1

The data frame consists of the following columns:

Name Description
time Survival time in days since being operated
status 1=Died from melanoma, 2=Alive, 3=Died from other causes
sex 1=Male, 0=Female
year Year of operation
thickness Tumour thickness (mm)
ulcer 1=present, 0=absent

 

nitrofen

The nitrofen data set has 50 rows and five columns. It is a herbicide that was used to control weeds in cereals and rice. Although nitrofen is non-toxic to humans, it is no longer used in the US. The data frame is given below:

> library(boot)

> head(nitrofen)
  conc brood1 brood2 brood3 total
1    0      3     14     10    27
2    0      5     12     15    32
3    0      6     11     17    34
4    0      6     12     15    33
5    0      6     15     15    36
6    0      5     14     15    34

The five fields are described below in detail:

Name Description
conc The nitrofen concentration (mug/litre)
brood1 Number of live offspring in the first brood
brood2 Number of live offspring in the second brood
brood3 Number of live offspring in the third brood
total Total number of live offspring in the first three broods

nuclear

The nuclear data set has information on light water reactor (LWR) plants that were constructed in the US in the early 1970s. Thirty-two plants were constructed, and the data was used for future cost prediction of such plants. The first few entries from the data set are given below:

> library(boot)

> head(nuclear)
    cost  date t1 t2  cap pr ne ct bw cum.n pt
1 460.05 68.58 14 46  687  0  1  0  0    14  0
2 452.99 67.33 10 73 1065  0  0  1  0     1  0
3 443.22 67.33 10 85 1065  1  0  1  0     1  0
4 652.32 68.00 11 67 1065  0  1  1  0    12  0
5 642.23 68.00 11 78 1065  1  1  1  0    12  0
6 345.39 67.92 13 51  514  0  1  1  0     3  0

The description of the various numeric columns is as follows:

Name Description
cost Cost of construction (millions of dollars)
date Date on which construction permit was issued
t1 Time between application and construction permits
t2 Time between issue of operating licence and construction permit
cap Capacity of power plant (MWe)
pr 1=Prior existence of LWR plant on site, 0=None
ne 1=Plant in north-east US, 0=Otherwise
ct 1=Cooling tower in plant, 0=None
bw 1=Nuclear steam supply system by Babcock-Wilcox, 0=Otherwise
cum.n Cumulative number of power plants constructed by architect-engineer
pt 1=Plant with partial turnkey guarantees, 0=Otherwise

 

You are encouraged to explore the various other R data sets available at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html.

LEAVE A REPLY

Please enter your comment!
Please enter your name here