2 Statistical inference and the Central Limit Theorem
- Population parameters and sample statistics
- Sampling distributions
- The Central Limit Theorem
2.1 Statistical inference
In statistical inference we are interested in learning some quantity representing some feature or parameter of a population. Population is large and we cannot examine all the values in the population directly. Therefore, to learn about the population parameter we take samples from the population and use the information in the samples to draw conclusions about the population.
Population parameters are fixed values. We rarely know the parameter values because it is often difficult to obtain measures from the entire population.
Sample statistics are quantities computed from the values in the sample. They are random variables because they vary from sample to sample.
We use sample statistics to draw conclusions about unknown population parameters.
Example
Suppose we are interested in knowing the mean height of the population living in England. Population is large and we cannot measure the height of everybody directly. We can get an estimate of the mean height by taking a sample of say, 5000 people, and measuring their heights.
We can provide point estimates and interval estimates for population parameters.
Population mean is denoted by \(\mu\) and it is unknown
Sample size is \(n = 5000\)
Heights of people in the sample: \((x_1, x_2, \ldots, x_{5000})\)
Point estimate = estimate that specifies a single value of the population.
Sample statistic is the average heights of people in the sample: \(\bar x = \sum x_i/n\)
Interval estimate = estimate that specifies a range of plausible values for the population parameter. We are 95% confident the population mean \(\mu\) is within \((a, b)\)
Example
A survey is carried out at a university to estimate the proportion of undergraduate students who drive to campus to attend classes. One thousand students are randomly selected and asked whether they drive or not to campus to attend classes. The population is all of the undergraduates at that university. The sample is the group of 1000 undergraduate students surveyed. The parameter is the true proportion of all undergraduate students at that university who drive to campus to attend classes. The statistic is the proportion of the 1000 sampled undergraduates who drive to campus to attend classes.
- Population proportion \(p\) unknown
- Sample size \(n = 1000\)
- Sample statistic is the sample proportion: \(\hat p = \frac{\mbox{number people drive}}{n}\)
Example
A study is conducted to estimate the true mean annual income of all adult residents of California. The study randomly selects 2000 adult residents of California. The population consists of all adult residents of California. The sample is the 2000 residents in the study. The parameter is the true mean annual income of all adult residents of California. The statistic is the mean income of the 2000 residents in this sample.
- Population mean \(\mu\) unknown
- Sample size \(n = 2000\)
- Sample statistic is the sample mean: \(\bar x = \sum x_i/n\)
2.2 Sampling distributions of sample statistics
A sample statistic is a random variable because it varies from sample to sample. The distribution of a sample statistic is called sampling distribution.
The sampling distribution of a sample statistic has mean approximately equal to the population mean and standard deviation called standard error (SE).
Steps to construct the sampling distribution of a sample statistic (e.g., sample mean, sample proportion, etc.):
- Take a random sample of size \(n\) from the population
- For each sample, calculate the sample statistic (e.g., sample mean, sample proportion, etc.)
- Return the sample observations to the population and repeat the process many times
- If (theoretically) we choose every possible sample of size \(n\) from the population and calculate the sample statistic, we obtain the sampling distribution of the sample statistic
2.3 The Central Limit Theorem (CLT)
The Central Limit Theorem states that for any given population distribution with mean \(\mu\) and standard deviation \(\sigma\), if we randomly take sufficiently large (\(n \geq 30\)) random samples from the population with replacement, the sampling distribution of the sample mean \(\bar X\) is normally distributed with mean \(\mu\) and standard error SE = \(\sigma/\sqrt{n}\).
When we choose many random samples from a population, the sampling distribution of the sample mean \(\bar X\) is centered at the population mean \(\mu\) and is less spread out than the population distribution.
\[\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]
Example
Consider a sample of size \(n=40\) from a population with mean \(\mu = 30\) and standard deviation \(\sigma=10\). What are the mean and standard deviation of the sample mean?
Solution:
\(\mu_{\bar X} = \mu = 30\)
\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{40}\)
\(\bar X \sim N(\mu = 30, \sigma = 10/\sqrt{40})\)
Example
The mean height of a population is 160cm with standard deviation equal to 10. If we sample 50 people, what is the probability that the average height of the sample is greater than 170cm?
Solution:
\(\mu_{\bar X} = \mu = 160\)
\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{50}\)
\(\bar X \sim N(\mu = 160, \sigma = 10/\sqrt{50})\)
The probability that the average height of the sample is greater than 170cm is \(P(\bar X > 170)\).
2.3.1 Simulation Central Limit Theorem
Consider a population with mean \(\mu = 10\) and standard deviation \(\sigma = 3\) (size N = 20000).
Choose a large number of samples of size \(n=50\) from the population
Sample 1
[1] 9.993879 8.569221 14.002999 15.572375 10.444261 12.550169 8.663202
[8] 8.599186 5.507975 14.243104 13.483972 7.994468 8.164821 8.729816
[15] 4.048565 7.579630 13.151123 8.991758 10.193132 9.852788 9.548125
[22] 10.891643 8.743427 10.457677 8.448128 12.371910 8.938407 11.365475
[29] 10.144704 10.144975 13.745066 12.662291 10.541283 17.612397 5.350957
[36] 17.053265 13.092044 8.998322 8.928377 1.844458 10.195583 6.544973
[43] 9.072016 6.746254 8.311198 10.998591 8.477170 9.194750 6.382494
[50] 12.078654
Sample 2
[1] 11.274640 9.350546 7.731650 15.342497 4.653399 9.701417 9.671311
[8] 9.312136 14.705495 13.986007 9.943673 12.503514 4.309633 10.152540
[15] 15.706043 14.627635 9.869771 12.979526 9.165749 4.203657 6.795117
[22] 8.884930 9.856501 7.331034 9.804198 5.353006 6.538275 11.937724
[29] 9.752645 3.375027 6.760458 9.325697 11.025570 7.718277 8.991503
[36] 10.400637 6.757612 8.867469 7.044613 9.538404 6.905586 15.447653
[43] 7.225397 5.671074 11.372215 11.956342 7.166809 12.283335 4.165700
[50] 8.097065
…
Sample 1000
[1] 13.890778 10.403217 11.700928 9.440118 11.908672 13.462019 10.091501
[8] 3.924241 11.186637 10.353187 9.254783 17.722307 8.975664 10.918092
[15] 9.937171 8.929721 8.796294 8.649769 13.142073 6.731038 11.531051
[22] 3.093476 14.668897 6.828257 14.711374 14.529920 9.852645 9.568336
[29] 11.744470 11.583434 11.662382 9.914850 14.073163 12.995216 6.334600
[36] 11.839335 10.543528 10.765682 10.760557 11.733296 13.047674 12.563859
[43] 5.917514 9.095396 7.490335 10.145974 11.746630 6.383945 11.195217
[50] 14.015001
Calculate sample means
\(\bar X\) sample 1
[1] 9.984421
\(\bar X\) sample 2
[1] 9.310814
…
\(\bar X\) sample 1000
[1] 10.59508
- The mean of the sample means is equal to the mean of the population. \[\mu_{\bar X} = \mu\] (independent of sample size)
\(\mu_{\bar X}\)
[1] 9.977689
\(\mu\)
[1] 9.981909
- The standard deviation of sample means is equal to the standard deviation of the population divided by the square root of the sample size \[\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\]
\(\sigma_{\bar X}\)
[1] 0.424626
\(\frac{\sigma}{\sqrt{n}}\)
[1] 0.4202633
If the population is normal, then the sample means will have a normal distribution
If the population is not normal (can be skewed, discrete, etc) but the sample is sufficiently large (\(n \geq 30\)), then the sampling distribution of sample means approximates a normal distribution.
Sampling distribution of sample means
2.3.2 Simulation Central Limit Theorem
We illustrate the Central Limit Theorem (CLT) by means of a simulation. The CLT applies to any distribution. To illustrate the CLT we can consider any population, it does not necessarily need to be normal. For example, we consider the following population which consists of 100000 numbers drawn from two different normal distributions.
<- c(rnorm(20000, mean = 10, sd = 3),
pop rnorm(80000, mean = 70, sd = 10))
hist(pop, main = "Population")
The mean and standard deviation of the population are the following:
mean(pop)
[1] 57.92984
sd(pop)
[1] 25.62279
Now we randomly select 1000 samples of size \(n=50\).
<- 50
n <- NULL
mats for(i in 1:1000){
<- sample(pop, n, replace = TRUE)
s <- cbind(mats, s)
mats }
We calculate the mean of each sample and plot the sampling distribution of the sample mean.
<- apply(mats, 2, mean)
samplemeans hist(samplemeans, main = "Sample mean")
We calculate the mean and standard deviation of the sampling distribution of the sample mean and check the CLT.
Sample size \(n \geq 30\).
Sampling distribution of sample mean approximates to a normal distribution.
Mean of the sample mean is \(\mu_{\bar X} = \mu\)
mean(samplemeans)
[1] 57.86457
mean(pop)
[1] 57.92984
- Standard error (standard deviation of the sample mean) is \(\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\)
sd(samplemeans)
[1] 3.586244
sd(pop)/sqrt(n)
[1] 3.62361