Tutorial 3: Regression modelling in R

Aims

By the end of this third tutorial, you will be able to

fit regression models in R.
interpret linear models in context.
understand the uncertainty in model predictions.

Linear regression modelling - an example

A common use of linear regression models is to predict quantities that are are otherwise difficult or expensive to measure directly. One example is predicting a person’s percentage body fat. This is clinically useful information, because it is a good indicator of some disease outcomes.

But body fat percentage is not easy to measure precisely. A commonly used precise method relies on the fact that muscle and bone are both denser than fat. By submerging an individual in water, we can obtain a measure of their body’s volume. Knowing this and their weight, we can compute their average density, $d$, in $g/cm^3$ . Previous studies (Brozek 1963) have shown that the percentage of body fat is then given approximately by the formula

\[\left(\frac{4.57}{d} − 4.142\right) × 100\%,\]

to within a precision of around $1\%$ , which is good enough.

The data

We’ll work with the fat dataset from the faraway package (Faraway 2004). You will need to install this package if you have not used it before, using install.packages("faraway"). Once the package is installed, it must be loaded,

library(faraway)

The help file ?fat gives some useful background information on the data we’ll be using. The data consists of 10 body circumference measurements for 252 men, together with information on their age, weight and height.

In addition, each individual’s density was measured by submerging them in water, and their percentage body fat was estimated precisely using Brozek’s equation (Johnson 1996).

Exploratory data analysis

To familiarise yourself with the structure of the data, enter the command View(fat) in the console to get a spreadsheet representation of the data.

Matching up the names of the variables with the information in the help file, we see that the first two columns give estimates of the body fat percentage using Brozek’s equation and another method, Siri’s equation. We then have density, age, weight, height and adiposity (more commonly known as BMI - body mass index).

As a simple check that we understand the data, we make a plot of brozek against density. The data points should all lie on a smooth curve.

plot(fat$density, fat$brozek)

Almost all of the observations do lie on a smooth curve, but there are some unusual observations that might be a cause for concern.

To make the relationship even clearer, we could apply the Brozek equation ourselves, and compare the result with that in the data.

plot(457/fat$density - 414.2, fat$brozek)
abline(0, 1, col = "red")

The red line is $y = x$ , showing that apart from a few outlying observations, the density and Brozek values match up well.

Now we look at the body circumference measurements, stored in columns 9 to 18 of the data frame.

One way to visualise how the measurements are distributed is to look at a pairs plot. For each pair of variables, we make a scatter plot. They can be seen in the plot below.

library(GGally)
ggpairs(fat[,9:18],
        upper = list(continuous = wrap("cor", size = 3)))

Note the roughly elliptical shape visible in the plot of each pair of variables, aligned on the diagonal.

This suggests two observations

The data are jointly normally distributed. This is what we would expect for naturally occurring data, influenced by many genetic and environmental factors.
Pairs of variables are strongly positively correlated. This makes sense: on average, individuals with larger necks will have larger knees, forearms etc.

Linear modelling

Focusing for now on a single variable, we make a plot of the Brozek fat estimate against chest circumference.

plot(fat$chest,
     fat$brozek,
     xlab = "chest circumference (cm)",
     ylab = "% body fat")

There is clearly a linear relationship between the two quantities, although it is somewhat noisy. We determine the equation of the best fitting line representing this relationship using the lm command (it stands for linear model).

Note the syntax used for fitting the model: brozek ~ chest. We say this out loud as Brozek is explained by chest.

fit0 = lm(brozek ~ chest, data = fat)

More explicitly, the model is $\text{brozek}_i =\beta_0 +\beta_1\text{chest}_i +\epsilon_i,$

where $\text{brozek}_i$ and $\text{chest}_i$ represent the Brozek body fat estimate and chest circumference of individual $i$, $\beta_0$ and $\beta_1$ are the y-intercept and slope of the line and $\epsilon_i$ is a random error.

Model fitting

The lm function determines good values of the model parameters $\beta_0$ and $\beta_1$ , in the sense of minimising a measure of the average error when using the line to make predictions.

A more mathematically complete account of how regression works can be found in the supplementary tutorial.

The model summary contains a wealth of information about the linear model that lm has determined.

summary(fit0)

##
## Call:
## lm(formula = brozek ~ chest, data = fat)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -13.8875  -3.8211  -0.2752   3.4950  13.8989
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -46.21636    4.18460  -11.04   <2e-16 ***
## chest         0.64622    0.04136   15.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 250 degrees of freedom
## Multiple R-squared:  0.494,  Adjusted R-squared:  0.492
## F-statistic: 244.1 on 1 and 250 DF,  p-value: < 2.2e-16

There is a lot to make sense of here. We’ll start by identifying the most important values in the coefficient table. Here we see the estimate of the intercept -46.21636 and the estimate of the slope 0.64622.

This says that the best-fitting line is given by

\[\text{brozek}_i = -46.21636 + 0.64622\times\text{chest}_i.\]

We can add this line to our earlier plot,

plot(fat$chest,
     fat$brozek,
     xlab = "chest circumference (cm)",
     ylab = "% body fat")
abline(fit0, col = "red")

Showing a reasonable fit to the data.

Parameter uncertainty

The second column of the coefficient table gives the standard error of each parameter estimate. This is a measure of the uncertainty in the estimate that comes from having used only a fairly small sample: slightly different data would of course have led to slightly different estimates.

We see that the standard error of the slope estimate is 0.04136. What this number tells us is that the slope estimate of 0.64622 is only really correct to one significant figure. If we had chosen different samples from the population, we could very easily have estimated the slope as anywhere in the interval $0.64\pm 0.04$ .

Interpreting model parameters

Note the units of the parameters we have estimated. The intercept parameter has the same units as brozek, i.e. percentage body fat, and the slope has units of percentage body fat per centimetre of chest circumference.

The interpretation of the slope parameter is that, all other things being equal, each additional centimetre of chest circumference is associated with an increase in body fat of $\sim 0.6$ percentage points.

Residual standard error

The model summary shows a residual standard error of 5.524. This is a measure of the typical variability around the line, in the vertical direction. Its units are percentage body fat. It is a measure of how precise predictions from the model are expected to be: our predicted body fat percentages would only be correct to around 5.5 percentage points - considering the range of the data, this is not really an adequate level of precision. Someone with 20% body fat might easily be estimated as having 15% body fat or 25% body fat.

Improving the model

We can improve the fit of the model by including all of the other variables as additional predictors.

fit1 = lm(brozek ~ age + weight + height +
             neck + chest + abdom +
             hip + thigh + knee +
             ankle + biceps + forearm + wrist,
           data = fat)
summary(fit1)

##
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest +
##     abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,
##     data = fat)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -10.264  -2.572  -0.097   2.898   9.327
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.29255   16.06992  -0.952  0.34225
## age           0.05679    0.02996   1.895  0.05929.
## weight       -0.08031    0.04958  -1.620  0.10660
## height       -0.06460    0.08893  -0.726  0.46830
## neck         -0.43754    0.21533  -2.032  0.04327 *
## chest        -0.02360    0.09184  -0.257  0.79740
## abdom         0.88543    0.08008  11.057  < 2e-16 ***
## hip          -0.19842    0.13516  -1.468  0.14341
## thigh         0.23190    0.13372   1.734  0.08418.
## knee         -0.01168    0.22414  -0.052  0.95850
## ankle         0.16354    0.20514   0.797  0.42614
## biceps        0.15280    0.15851   0.964  0.33605
## forearm       0.43049    0.18445   2.334  0.02044 *
## wrist        -1.47654    0.49552  -2.980  0.00318 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7353
## F-statistic: 54.63 on 13 and 238 DF,  p-value: < 2.2e-16

Inspecting the summary, we see that the residual standard error has reduced to 3.988, an improvement, although still a little too imprecise for clinical use.

Here, adding further predictors did not really lead to a substantial improvement in the model’s predictions. The main reason for this is that the predictors we added are highly correlated: they are not independent sources of information.

Including highly correlated predictors in this way is not advisable - it means that individual parameters are estimated less precisely. We can see that in fit1, the standard error of our original predictor, chest has increased to 0.09.

Conclusion

These tutorials have guided you through some first steps in learning R. You are now well-equipped to explore data independently in R and to fit regression models.

References

Brozek, Josef. 1963. “Densitometric Analysis of Body Composition: Revision of Some Quantitative Assumptions.” *Ann NY Acad Sci* 110: 113–40.

Faraway, Julian J. 2004. *Linear Models with r*. Chapman; Hall/CRC.

Johnson, Roger W. 1996. “Fitting Percentage of Body Fat to Simple Body Measurements.” *Journal of Statistics Education* 4 (1).