Solutions: https://www.paulamoraga.com/course-aramco/99-problems-4regression-solutions.html

The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable `smoke`

is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.

Estimate | Std. Error | t value | Pr(\(\geq \mid t\mid\)) | |
---|---|---|---|---|

(Intercept) | 123.05 | 0.65 | 189.60 | 0.0000 |

smoke | -8.94 | 1.03 | -8.65 | 0.0000 |

Write the equation of the regression model.

Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.

Is there a statistically significant relationship between the average birth weight and smoking?

**Solution**

\(\widehat{\mbox{weight}} = 123.05 -8.94 \times \mbox{smoke}\)

The estimated body weight of babies born to smoking mothers is 8.94 ounces lower than babies born to non-smoking mothers.

Smoker: \(\widehat{\mbox{weight}} = 123.05 -8.94 \times 1 = 114.11\)

Non-smoker: \(\widehat{\mbox{weight}} = 123.05 -8.94 \times 0 = 123.05\)

\(H_0: \beta_1 = 0\)

\(H_1: \beta_1 \neq 0\)

\(\alpha = 0.05\)

\(T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -8.65\)

p-value is the probability to obtain a test statistic equal to -8.65 or more extreme assuming the null hypothesis is true.

p-value is approximately 0. p-value \(< \alpha\), so we reject the null hypothesis.

The data provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and smoking.

The previous exercise Part I introduces a data set on birth weight of babies. Another variable we consider is `parity`

, which is 1 if the child is the first born, and 0 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from `parity`

.

Estimate | Std. Error | t value | Pr(\(\geq \mid t\mid\)) | |
---|---|---|---|---|

(Intercept) | 120.07 | 0.60 | 199.94 | 0.0000 |

parity | -1.93 | 1.19 | -1.62 | 0.1052 |

Write the equation of the regression model.

Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

Is there a statistically significant relationship between the average birth weight and parity?

**Solution**

\(\widehat{\mbox{weight}} = 120.07 -1.93 \times \mbox{parity}\)

The estimated body weight of first born babies is 1.93 ounces lower than other babies.

First borns: \(\widehat{\mbox{weight}} = 120.07 - 1.93 \times 1 = 118.14\)

Others: \(\widehat{\mbox{weight}} = 120.07 - 1.93 \times 0 = 120.07\)

\(H_0: \beta_1 = 0\)

\(H_1: \beta_1 \neq 0\)

\(\alpha = 0.05\)

\(T = \frac{\hat \beta_1-0}{SE_{\hat \beta_1}} = -1.62\)

p-value is the probability to obtain a test statistic equal to -1.62 or more extreme assuming the null hypothesis is true.

p-value is 0.1052. p-value \(> \alpha\), so we fail to reject the null hypothesis.

The data do not provide strong evidence that the true slope parameter is different than 0 and that there is an association between birth weight and parity.

We considered the variables `smoke`

and `parity`

, one at a time, in modeling birth weights of babies in previous exercises Part I and II. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (`gestation`

), mother’s age in years (`age`

), mother’s height in inches (`height`

), and mother’s pregnancy weight in pounds (`weight`

).

Use the data `babies.csv`

(LINK) to answer the following questions.

- Write the equation of the regression model that relates birth weights of babies (
`bwt`

) to variables`gestation`

,`parity`

,`age`

,`height`

,`weight`

, and`smoke`

. - Interpret the slopes of
`gestation`

,`age`

and`parity`

in this context. - The coefficient for parity is different than in the linear model shown in exercise Part II. Why might there be a difference?
- Calculate the residual for the first observation in the data set.
- Interpret the adjusted \(R^2\).

**Solution**

```
<- read.csv("data/babies.csv")
d head(d)
```

`summary(d)`

```
## case bwt gestation parity
## Min. : 1.0 Min. : 55.0 Min. :148.0 Min. :0.0000
## 1st Qu.: 309.8 1st Qu.:108.8 1st Qu.:272.0 1st Qu.:0.0000
## Median : 618.5 Median :120.0 Median :280.0 Median :0.0000
## Mean : 618.5 Mean :119.6 Mean :279.3 Mean :0.2549
## 3rd Qu.: 927.2 3rd Qu.:131.0 3rd Qu.:288.0 3rd Qu.:1.0000
## Max. :1236.0 Max. :176.0 Max. :353.0 Max. :1.0000
## NA's :13
## age height weight smoke
## Min. :15.00 Min. :53.00 Min. : 87.0 Min. :0.0000
## 1st Qu.:23.00 1st Qu.:62.00 1st Qu.:114.8 1st Qu.:0.0000
## Median :26.00 Median :64.00 Median :125.0 Median :0.0000
## Mean :27.26 Mean :64.05 Mean :128.6 Mean :0.3948
## 3rd Qu.:31.00 3rd Qu.:66.00 3rd Qu.:139.0 3rd Qu.:1.0000
## Max. :45.00 Max. :72.00 Max. :250.0 Max. :1.0000
## NA's :2 NA's :22 NA's :36 NA's :10
```

```
<- lm(bwt ~ gestation + parity + age + height + weight + smoke, data = d)
res summary(res)
```

```
##
## Call:
## lm(formula = bwt ~ gestation + parity + age + height + weight +
## smoke, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.613 -10.189 -0.135 9.683 51.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -80.41085 14.34657 -5.605 2.60e-08 ***
## gestation 0.44398 0.02910 15.258 < 2e-16 ***
## parity -3.32720 1.12895 -2.947 0.00327 **
## age -0.00895 0.08582 -0.104 0.91696
## height 1.15402 0.20502 5.629 2.27e-08 ***
## weight 0.05017 0.02524 1.987 0.04711 *
## smoke -8.40073 0.95382 -8.807 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.83 on 1167 degrees of freedom
## (62 observations deleted due to missingness)
## Multiple R-squared: 0.258, Adjusted R-squared: 0.2541
## F-statistic: 67.61 on 6 and 1167 DF, p-value: < 2.2e-16
```

\(\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{ gestation} - 3.33 \mbox{ parity} -0.01 \mbox{ age} + 1.15 \mbox{ height} + 0.05 \mbox{ weight} -8.40 \mbox{ smoke}\)

\(\hat \beta_{gestation} = 0.44\). The model predicts a 0.44 ounce increase in the birth weight of the baby for each additional day of pregnancy, helding all other covariates constant.

\(\hat \beta_{age} = -0.01\). The model predicts a 0.01 ounce decrease in the birth weight of the baby for each additional year in mothers age, helding all other covariates constant.

\(\hat \beta_{parity} = -3.33\). The model predicts a 3.33 ounce decrease in the birth weight of the baby for first born babies compared to others, helding all other covariates constant.

Parity might be correlated with the other variables in the model, which complicates model estimation.

First observation

`1, ] d[`

Observed weight = 120

Predicted weight:

\(\widehat{\mbox{weight}} = -80.41 + 0.44 \mbox{284} - 3.33 \mbox{0} -0.01 \mbox{27} + 1.15 \mbox{62} + 0.05 \mbox{100} -8.40 \mbox{0} = 120.58\)

`-80.41 + 0.44 * 284 - 3.33 *0 -0.01 *27 + 1.15 *62 + 0.05 *100 -8.40*0`

`## [1] 120.58`

residual = observed weight = predicted weight = 120 - 120.58 = -0.58

The model over-predicts this baby’s birth weight.

R\(^2\)(adj)= 25.41% The percentage of variation explained by the model is about 25% adjusting for the number of parameters.