1 Linear regression

1.1 Baby weights, Part I

The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.

Estimate Std. Error t value Pr($$\geq \mid t\mid$$)
(Intercept) 123.05 0.65 189.60 0.0000
smoke -8.94 1.03 -8.65 0.0000
1. Write the equation of the regression model.

2. Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.

3. Is there a statistically significant relationship between the average birth weight and smoking?

Solution

$$\widehat{\mbox{weight}} = 123.05 -8.94 \times \mbox{smoke}$$

The estimated body weight of babies born to smoking mothers is 8.94 ounces lower than babies born to non-smoking mothers.

Smoker: $$\widehat{\mbox{weight}} = 123.05 -8.94 \times 1 = 114.11$$

Non-smoker: $$\widehat{\mbox{weight}} = 123.05 -8.94 \times 0 = 123.05$$

$$H_0: \beta_1 = 0$$
$$H_1: \beta_1 \neq 0$$

$$\alpha = 0.05$$

$$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -8.65$$

p-value is the probability to obtain a test statistic equal to -8.65 or more extreme assuming the null hypothesis is true.

p-value is approximately 0. p-value $$< \alpha$$, so we reject the null hypothesis.

The data provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and smoking.

1.2 Baby weights, Part II

The previous exercise Part I introduces a data set on birth weight of babies. Another variable we consider is parity, which is 1 if the child is the first born, and 0 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

Estimate Std. Error t value Pr($$\geq \mid t\mid$$)
(Intercept) 120.07 0.60 199.94 0.0000
parity -1.93 1.19 -1.62 0.1052
1. Write the equation of the regression model.

2. Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

3. Is there a statistically significant relationship between the average birth weight and parity?

Solution

$$\widehat{\mbox{weight}} = 120.07 -1.93 \times \mbox{parity}$$

The estimated body weight of first born babies is 1.93 ounces lower than other babies.

First borns: $$\widehat{\mbox{weight}} = 120.07 - 1.93 \times 1 = 118.14$$

Others: $$\widehat{\mbox{weight}} = 120.07 - 1.93 \times 0 = 120.07$$

$$H_0: \beta_1 = 0$$
$$H_1: \beta_1 \neq 0$$

$$\alpha = 0.05$$

$$T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -1.62$$

p-value is the probability to obtain a test statistic equal to -1.62 or more extreme assuming the null hypothesis is true.

p-value is 0.1052. p-value $$> \alpha$$, so we fail to reject the null hypothesis.

The data do not provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and parity.

1.3 Baby weights, Part III

We considered the variables smoke and parity, one at a time, in modeling birth weights of babies in previous exercises Part I and II. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (gestation), mother’s age in years (age), mother’s height in inches (height), and mother’s pregnancy weight in pounds (weight).

Use the data babies.csv (LINK) to answer the following questions.

1. Write the equation of the regression model that relates birth weights of babies (bwt) to variables gestation, parity, age, height, weight, and smoke.
2. Interpret the slopes of gestation, age and parity in this context.
3. The coefficient for parity is different than in the linear model shown in exercise Part II. Why might there be a difference?
4. Calculate the residual for the first observation in the data set.
5. Interpret the adjusted $$R^2$$.

Solution

d <- read.csv("data/babies.csv")
head(d)
summary(d)
##       case             bwt          gestation         parity
##  Min.   :   1.0   Min.   : 55.0   Min.   :148.0   Min.   :0.0000
##  1st Qu.: 309.8   1st Qu.:108.8   1st Qu.:272.0   1st Qu.:0.0000
##  Median : 618.5   Median :120.0   Median :280.0   Median :0.0000
##  Mean   : 618.5   Mean   :119.6   Mean   :279.3   Mean   :0.2549
##  3rd Qu.: 927.2   3rd Qu.:131.0   3rd Qu.:288.0   3rd Qu.:1.0000
##  Max.   :1236.0   Max.   :176.0   Max.   :353.0   Max.   :1.0000
##                                   NA's   :13
##       age            height          weight          smoke
##  Min.   :15.00   Min.   :53.00   Min.   : 87.0   Min.   :0.0000
##  1st Qu.:23.00   1st Qu.:62.00   1st Qu.:114.8   1st Qu.:0.0000
##  Median :26.00   Median :64.00   Median :125.0   Median :0.0000
##  Mean   :27.26   Mean   :64.05   Mean   :128.6   Mean   :0.3948
##  3rd Qu.:31.00   3rd Qu.:66.00   3rd Qu.:139.0   3rd Qu.:1.0000
##  Max.   :45.00   Max.   :72.00   Max.   :250.0   Max.   :1.0000
##  NA's   :2       NA's   :22      NA's   :36      NA's   :10
res <- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = d)
summary(res)
##
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight +
##     smoke, data = d)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -57.613 -10.189  -0.135   9.683  51.713
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.41085   14.34657  -5.605 2.60e-08 ***
## gestation     0.44398    0.02910  15.258  < 2e-16 ***
## parity       -3.32720    1.12895  -2.947  0.00327 **
## age          -0.00895    0.08582  -0.104  0.91696
## height        1.15402    0.20502   5.629 2.27e-08 ***
## weight        0.05017    0.02524   1.987  0.04711 *
## smoke        -8.40073    0.95382  -8.807  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.83 on 1167 degrees of freedom
##   (62 observations deleted due to missingness)
## Multiple R-squared:  0.258,  Adjusted R-squared:  0.2541
## F-statistic: 67.61 on 6 and 1167 DF,  p-value: < 2.2e-16

$$\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{ gestation} - 3.33 \mbox{ parity} -0.01 \mbox{ age} + 1.15 \mbox{ height} + 0.05 \mbox{ weight} -8.40 \mbox{ smoke}$$

$$\hat \beta_{gestation} = 0.44$$. The model predicts a 0.44 ounce increase in the birth weight of the baby for each additional day of pregnancy, helding all other covariates constant.

$$\hat \beta_{age} = -0.01$$. The model predicts a 0.01 ounce decrease in the birth weight of the baby for each additional year in mothers age, helding all other covariates constant.

$$\hat \beta_{parity} = -3.33$$. The model predicts a 3.33 ounce decrease in the birth weight of the baby for first born babies compared to others, helding all other covariates constant.

Parity might be correlated with the other variables in the model, which complicates model estimation.

First observation

d[1, ]

Observed weight = 120

Predicted weight:

$$\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{284} - 3.33 \mbox{0} -0.01 \mbox{27} + 1.15 \mbox{62} + 0.05 \mbox{100} -8.40 \mbox{0} = 120.58$$

-80.41 + 0.44 * 284 - 3.33 *0 -0.01 *27 + 1.15 *62 + 0.05 *100 -8.40*0
## [1] 120.58

residual = observed weight = predicted weight = 120 - 120.58 = -0.58

The model over-predicts this baby’s birth weight.

R$$^2$$(adj)= 25.41% The percentage of variation explained by the model is about 25% adjusting for the number of parameters.