In statistical inference we are interested in learning some quantity representing some parameter of a population. Population is large and we cannot examine all the values in the population directly. Therefore, to learn about the population parameter we take a sample of size \(n\) from the population and use the information in the samples to draw conclusions about the population.
We can estimate population parameters by providing point estimates (single values) and interval estimates (range of values).
A confidence interval for an unknown population parameter provides a range of plausible values where we may find the true population parameter.
A \((c \times 100)\)% confidence interval for a population parameter \(\mu\) based on observations \(X=(X_1, \ldots, X_n)\) is a pair of statistics \(A(X)\) and \(B(X)\) such that
\[P(A(X)\leq \mu \leq B(X))=c\] Note the values of \(A(X)\) and \(B(X)\) change with each sample.
Confidence level \(c = 1 - \alpha\) and significance level \(\alpha = 1- c\).
Interpretation:
Construct a confidence interval
Select a confidence level \(c\) that describes the uncertainty of the sampling method. Common confidence levels are 95% and 90%
Identify a sample statistic that is used to estimate the population parameter. For example, the sample mean (\(\bar X\)) to estimate the population mean (\(\mu\)), or the sample proportion (\(\hat P\)) to estimate the population proportion (\(p\))
Identify the sampling distribution of the sample statistic and the standard error (SE)
Specify the confidence interval as
\[\mbox{sample statistic $\pm$ critical value $\times$ SE}\]
A \((1-\alpha) \times 100\)% confidence interval for the population proportion \(p\) is
\[\hat P \pm z^* \times SE = \hat P \pm z^* \times \sqrt{\hat P(1-\hat P)/n}\]
where \(z^*\) is the value such that \(\alpha/2\) of the probability in the standard normal distribution is below \(z^*\)
The conditions required to construct the confidence interval are
For example, a 95% confidence interval for a proportion can be calculated as
\[(\hat P - 1.96 \times \sqrt{\hat P(1-\hat P)/n},\ \hat P + 1.96 \times \sqrt{\hat P(1-\hat P)/n})\]
We select \(z^*\) so that the area between \(-z^*\) and \(z^*\) in the standard normal distribution, N(0, 1), corresponds to the confidence level \(c=1-\alpha\). \(z^*\) is the value such that \(\alpha/2\) of the probability in the standard normal distribution is below \(z^*\).
The sampling distribution of the sample proportion \(\hat P\) is approximately normal with mean the population proportion \(p\) and standard error SE = \(\sqrt{\frac{p(1-p)}{n}}\) \[\hat P \sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\] when the following conditions are satisfied:
Proof:
\(\hat P\) is the proportion of successes in \(n\) trials. So \(\hat P\) is equal to \(Y/n\) where \(Y \sim Binomial(n, p)\).
\(Y\) can be expressed as the sum of \(n\) independent, identically distributed Bernoulli random variables, \[Y = \sum_{i=1}^n X_i\] \[\mbox{with } X_i \sim Ber(p),\ E[X_i]=p,\ Var[X_i]=p(1-p).\]
The Central Limit Theorem says that when we collect a sufficiently large sample of \(n\) independent observations , the sampling distribution of the sample mean \(\bar X \sim N(\mu, \frac{\sigma}{\sqrt{n}})\). Thus,
\[\hat P = \frac{\sum_{i=1}^n X_i}{n} \sim N\left(p, \sqrt{\frac{p(1-p)}{n}}\right)\]
We can standardize \(\hat P\) \[Z = \frac{\hat P - p}{\sqrt{p(1-p)/n}} \sim N(0, 1)\]
\[P(-1.96 \leq \frac{\hat P - p}{\sqrt{\hat P(1-\hat P)/n}} \leq 1.96) = 0.95\]
The probability is 95% that \(\hat P\) is within 1.96 standard errors of its mean.
\[P(\hat P -1.96 \times {\sqrt{\hat P(1-\hat P)/n}} \leq p \leq \hat P + 1.96 \times {\sqrt{\hat P(1-\hat P)/n}}) = 0.95\]
A \((1-\alpha) \times 100\)% confidence interval for \(p\) is \[(\hat P -z^* \times {\sqrt{\hat P(1-\hat P)/n}},\ \hat P + z^* \times {\sqrt{\hat P(1-\hat P)/n}})\].
We select \(z^*\) so that the area between \(-z^*\) and \(z^*\) in the standard normal distribution, N(0, 1), corresponds to the confidence level \(c=1-\alpha\). \(z^*\) is the value such that \(\alpha/2\) of the probability in the standard normal distribution is below \(z^*\).
\((1-\alpha) \times 100\)% confidence interval for the population proportion \(p\) \[\hat P \pm z^* \times SE = \hat P \pm z^* \times \sqrt{\hat P(1-\hat P)/n}\]
Assumptions:
\(z^*\): value such that \(c=1-\alpha\) of the probability in the N(0,1) falls between \(-z^*\) and \(z^*\).
Value such that \(\alpha/2\) of the probability in the \(N(0,1)\) is below \(z^*\)
In a random sample of 40 students, 24 said they bought a copy of the textbook. Estimate the proportion of all students who bought the book and give a 95% confidence interval.
Solution \[\hat p \pm z^* \times SE = \hat p \pm z^* \times \sqrt{\hat p(1-\hat p)/n}\] Assumptions:
Sample statistic: \(\hat p = 24/40 = 0.6\)
Critical value: \(z^* = -1.96\)
(\(\alpha/2\) of the probability in \(N(0,1)\) is below \(z^*\))
qnorm(0.025)
## [1] -1.959964
\(\hat p \pm z^* \times SE = \hat p \pm z^* \times \sqrt{\hat p(1-\hat p)/n} = 0.6 \pm 1.96 \times \sqrt{ 0.6(1-0.6)/40} = (0.448, 0.752)\)
An estimate of the proportion of students who bought the book is 60%.
We are 95% confident that the proportion of students is between 44.8% and 75.2%.
A random sample of 826 people living in UK was surveyed to better understand their political preferences. 70% of the responses supported the political party A. Calculate a 95% confidence interval for the proportion of people that support the political party A.
Solution \[\hat p \pm z^* \times SE = \hat p \pm z^* \times \sqrt{\hat p(1-\hat p)/n}\]
Assumptions:
Sample statistic: \(\hat p = 24/40 = 0.70\)
Critical value: \(z^* = -1.96\)
(\(\alpha/2=0.05/2\) of the probability in \(N(0,1)\) is below \(z^*\))
qnorm(0.025)
## [1] -1.959964
\(\hat p \pm z^* \times SE = \hat p \pm z^* \times \sqrt{\hat p(1-\hat p)/n} = 0.70 \pm 1.96 \times 0.016 = (0.669, 0.731)\)
We are 95% confident that the true proportion of people who support political party A is between 0.669 and 0.731.
A \((1-\alpha) \times 100\)% confidence interval for the population mean \(\mu\) is
\[\bar X \pm t_{n-1}^* \times SE = \bar X \pm t_{n-1}^* \times \frac{S}{\sqrt{n}}\] where \(t^*_{n-1}\) is the value such that \(\alpha/2\) of the probability in the \(t(n-1)\) distribution is below \(t^*_{n-1}\)
The conditions required to construct the confidence interval are
When the population standard deviation is known we construct the confidence interval as follows
\[\bar X \pm z^* \times SE = \bar X \pm z^* \times \frac{\sigma}{\sqrt{n}}\]
where \(z^*\) is the value such that \(\alpha/2\) of the probability in the standard normal distribution is below \(z^*\)
For example, a 95% confidence interval for the population mean when \(\sigma\) is known can be calculated as
\[(\bar X - 1.96 \times \frac{\sigma}{\sqrt{n}},\ \bar X + 1.96 \times \frac{\sigma}{\sqrt{n}})\]
The sampling distribution of \(\bar X\) is approximately normal with mean \(\mu\) and standard error SE = \(\frac{\sigma}{\sqrt{n}}\)
\[\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]
when the following conditions are satisfied:
We can standardize \(\bar X\) \[Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim N(0, 1)\]
\[P(-1.96 \leq \frac{\bar X - \mu}{\sigma/\sqrt{n}} \leq 1.96) = 0.95\]
The probability is about 95% that \(\bar X\) is within 1.96 standard errors of its mean.
\[P(\bar X -1.96 \times \frac{\sigma}{\sqrt{n}} \leq \ \mu \leq \bar X + 1.96 \times \frac{\sigma}{\sqrt{n}}) = 0.95\] A \((1-\alpha) \times 100\)% confidence interval for \(\mu\) is \[(\bar X -z^* \times \frac{\sigma}{\sqrt{n}} , \bar X + z^* \times \frac{\sigma}{\sqrt{n}})\]
where \(z^*\) is the value such \(\alpha/2\) of the probability in the N(0,1) distribution is below \(z^*\).
In practice, we cannot directly calculate the standard error for \(\bar X\) since we do not know the population standard deviation, \(\sigma\). When the population standard deviation \(\sigma\) is not known, we approximate \(\sigma\) with the sample standard deviation \(s\): \[SE = \frac{\sigma}{\sqrt{n}} \approx \frac{S}{\sqrt{n}}\]
When the sample size is big we can accurately estimate \(\sigma\) with \(s\). However, the estimate is less precise with smaller samples and this leads to problems when using the normal distribution to model \(\bar X\). Therefore we use the t distribution with \(n-1\) degrees of freedom.
\[T = \frac{\bar X - \mu}{S/\sqrt{n}} \sim t(n-1)\]
A \((1-\alpha) \times 100\)% confidence interval for \(\mu\) is \[(\bar X - t^*_{n-1} \frac{S}{\sqrt{n}},\ \bar X + t^*_{n-1} \frac{S}{\sqrt{n}})\]
where \(t^*_{n-1}\) is the value such \(\alpha/2\) of the probability in the t(n-1) distribution is below \(t^*_{n-1}\).
In general, we construct a confidence interval for the population mean by using a t distribution \(n-1\) degrees of freedom. Note that when we have a large number of observations, the degrees of freedom of the t distribution will be large and the t distribution will look like the standard normal distribution (when the degrees of freedom is greater than 30, the t-distribution is nearly indistinguishable from the normal distribution).
Characteristics:
The t distribution has a parameter called degrees of freedom (df). The df describes the form of the t-distribution. When the number of degrees of freedom increases, the t-distribution approaches the standard normal, N(0, 1)
R syntax
Purpose | Function | Example |
---|---|---|
Generate n random values from a t distribution |
rt(n, df) |
rt(1000, 2) generates 1000 values from a t distribution with 2 degrees of freedom |
Probability Density Function (PDF) | dt(x, df) |
dt(1, 2) density at value 1 (height of the PDF at value 1) of the t distribution with 2 degrees of freedom |
Cumulative Distribution Function (CDF) | pt(q, df) |
pt(5, 2) area under the density function of the t(2) to the left of 5 |
Quantile Function (inverse of pt() ) |
qt(p, df) |
qt(0.97, 2) value at which the CDF of the t(2) distribution is equal to 0.97 |
In a class survey, students are asked how many hours they sleep per night. In a sample of 52 students, the mean was 5.77 hours with a standard deviation of 1.572 hours. Construct a 95% confidence interval for the mean number of hours slept per night in the population from which this sample was drawn.
Solution
\(\sigma\) is unknown, we use \(s\) and a t distribution with \(n-1=52-1\) df \[\bar x \pm t^*_{51} \times SE = \bar x - t^*_{51} \times \frac{s}{\sqrt{n}}\]
Assumptions:
Sample statistic: \(\bar x = 5.77\)
Critical value: \(t^*_{51}= -2.01\)
(\(\alpha/2=0.05/2\) of the probability in t(51) is below \(t^*_{51}\))
qt(0.025, 51)
## [1] -2.007584
\(\bar x \pm t^*_{51} \times SE = \bar x \pm t^*_{51} \times \frac{s}{\sqrt{n}} = 5.77 \pm 2.01 \times \frac{1.572}{\sqrt{52}} = (5.33, 6.21)\).
We are 95% confident the population mean is between 5.33 and 6.21 hours.
Elevated mercury concentrations are an important problem for both dolphins and other animals who occasionally eat them. Calculate a 95% CI for the average mercury content in dolphins using a sample of 19 dolphins in Japan. In the sample, \(n=19\), \(\bar x = 4.4\), \(s = 2.3\), minimum \(= 1.7\) and maximum \(= 9.2\) \(\mu\)g/wet gram (micrograms of mercury per wet gram of muscle).
Solution
\(\sigma\) is unknown, we use \(s\) and a t distribution with \(n-1=19-1\) df \[\bar x \pm t^*_{18} \times SE = \bar x \pm t^*_{18} \times \frac{s}{\sqrt{n}}\]
Assumptions:
Sample statistic: \(\bar x = 4.4\)
Critical value: \(t^*_{18} = -2.10\)
(\(\alpha/2=0.05/2\) of the probability in the t(18) distribution is below \(t^*_{18}\))
qt(0.025, 18)
## [1] -2.100922
\(\bar x \pm t^*_{18} \times SE = \bar x \pm t^*_{18} \times \frac{s}{\sqrt{n}} = 4.4 \pm 2.10 \times \frac{2.3}{\sqrt{19}} = (3.29, 5.51)\)
We are 95% confident the average mercury content in dolphins is between 3.29 and 5.51 \(\mu\)g/wet gram.
A 95% CI for a population parameter denotes we are 95% confident the population parameter lies within the interval.
It is incorrect to say that there is a probability equal to 0.95 that the population parameter lies within the interval because the population parameter is a constant value and therefore the probability that the population parameter lies within the confidence interval is 0 or 1.
The interpretation of a 95% CI constructed using a sample of size \(n\) is that if we take many samples of size \(n\) from the population, and calculate the 95% CIs for each sample, then we would expect 95% of these CI to contain the value of the population parameter.
We can illustrate this by means of a simulation. Specifically, we consider a population and calculate many 95% CIs for the population mean. Then we observe 95% of the CIs contain the population mean.
Let us consider a population of size \(N=100000\)
<- rnorm(100000)
population <- mean(population)
population_mean population_mean
## [1] -0.001880581
The population mean is -0.0018806. Now we take sample of size \(n=100\) and calculate a 95% CI for the population mean.
<- sample(population, 100)
sample1 <- mean(sample1) - 1.96 * (sd(sample1)/sqrt(length(sample1)))
ll <- mean(sample1) + 1.96 * (sd(sample1)/sqrt(length(sample1)))
ul c(ll, ul)
## [1] -0.2455720 0.1386506
We are 95% CI confident that the population mean is within -0.245572 and 0.1386506.
The 95% CI is constructed in a way such that if we take many samples of size \(n=100\) from the population and calculate 95% CIs for each of them, 95% of the intervals will contain the population mean. We can see this by means of a simulation. First, we take \(M=1000\) samples of size \(n=100\) from the population.
<- 100 # sample size
n <- 1000 # number of samples
M <- matrix(NA, nrow = M, ncol = n) # matrix with the samples
samples for(m in 1:M){
<- sample(population, n)
samples[m, ] }
Then we calculate 95% CI for each sample.
<- apply(samples, 1,
ll FUN = function(samplem){
mean(samplem) - 1.96 * (sd(samplem)/sqrt(length(samplem)))
})
<- apply(samples, 1,
ul FUN = function(samplem){
mean(samplem) + 1.96 * (sd(samplem)/sqrt(length(samplem)))
})
If we take a large number of samples, the percentage of intervals that contain the population mean is 95%.
mean(ll <= population_mean & population_mean <= ul)
## [1] 0.95
We can visualize the 95% CIs together with the population mean and observe the 95% of the CIs contain the population mean. Here, we plot only the first 100 intervals to be able to see the intervals.
library(ggplot2)
<- data.frame(num = 1:M, ll = ll, ul = ul,
d contain = as.factor(ll <= population_mean & population_mean <= ul))
ggplot(data = d[1:100, ]) +
geom_segment(aes(x = num, y = ll, xend = num, yend = ul, col = contain)) +
geom_hline(yintercept = population_mean, col = "red") +
xlab("Sample number") + ylab("95% CI") + theme_bw()
Exercise
Repeat this simulation where the population parameter of interest is a proportion.
<- rbinom(n = 100000, size = 1, prob = 0.2)
population <- mean(population)
population_proportion
population_proportion
<- sample(population, 100)
sample1 <- mean(sample1)
phat <- phat - 1.96 * sqrt(phat*(1-phat)/length(sample1))
ll <- phat + 1.96 * sqrt(phat*(1-phat)/length(sample1))
ul c(ll, ul)
<- 100 # sample size
n <- 1000 # number of samples
M <- matrix(NA, nrow = M, ncol = n) # matrix with the samples
samples for(m in 1:M){
<- sample(population, n)
samples[m, ]
}
<- apply(samples, 1,
ll FUN = function(samplem){
<- mean(samplem)
phat - 1.96 * sqrt(phat*(1-phat)/length(samplem))
phat
})
<- apply(samples, 1,
ul FUN = function(samplem){
<- mean(samplem)
phat + 1.96 * sqrt(phat*(1-phat)/length(samplem))
phat
})
mean(ll <= population_proportion & population_proportion <= ul)
library(ggplot2)
<- data.frame(num = 1:M, ll = ll, ul = ul,
d contain = as.factor(ll <= population_proportion & population_proportion <= ul))
ggplot(data = d[1:100, ]) +
geom_segment(aes(x = num, y = ll, xend = num, yend = ul, col = contain)) +
geom_hline(yintercept = population_proportion, col = "red") +
xlab("Sample number") + ylab("95% CI") + theme_bw()