2 Statistical inference and the Central Limit Theorem
- Population parameters and sample statistics
- Sampling distributions
- The Central Limit Theorem
2.1 Statistical inference
In statistical inference we are interested in learning some quantity representing some feature or parameter of a population. Population is large and we cannot examine all the values in the population directly. Therefore, to learn about the population parameter we take samples from the population and use the information in the samples to draw conclusions about the population.
Population parameters are fixed values. We rarely know the parameter values because it is often difficult to obtain measures from the entire population.
Sample statistics are quantities computed from the values in the sample. They are random variables because they vary from sample to sample.
We use sample statistics to draw conclusions about unknown population parameters.
Example
Suppose we are interested in knowing the mean height of the population living in England. Population is large and we cannot measure the height of everybody directly. We can get an estimate of the mean height by taking a sample of say, 5000 people, and measuring their heights.
We can provide point estimates and interval estimates for population parameters.
Population mean is denoted by \(\mu\) and it is unknown
Sample size is \(n = 5000\)
Heights of people in the sample: \((x_1, x_2, \ldots, x_{5000})\)
Point estimate = estimate that specifies a single value of the population.
Sample statistic is the average heights of people in the sample: \(\bar x = \sum x_i/n\)
Interval estimate = estimate that specifies a range of plausible values for the population parameter. We are 95% confident the population mean \(\mu\) is within \((a, b)\)
Example
A survey is carried out at a university to estimate the proportion of undergraduate students who drive to campus to attend classes. One thousand students are randomly selected and asked whether they drive or not to campus to attend classes. The population is all of the undergraduates at that university. The sample is the group of 1000 undergraduate students surveyed. The parameter is the true proportion of all undergraduate students at that university who drive to campus to attend classes. The statistic is the proportion of the 1000 sampled undergraduates who drive to campus to attend classes.
- Population proportion \(p\) unknown
- Sample size \(n = 1000\)
- Sample statistic is the sample proportion: \(\hat p = \frac{\mbox{number people drive}}{n}\)
Example
A study is conducted to estimate the true mean annual income of all adult residents of California. The study randomly selects 2000 adult residents of California. The population consists of all adult residents of California. The sample is the 2000 residents in the study. The parameter is the true mean annual income of all adult residents of California. The statistic is the mean income of the 2000 residents in this sample.
- Population mean \(\mu\) unknown
- Sample size \(n = 2000\)
- Sample statistic is the sample mean: \(\bar x = \sum x_i/n\)
2.2 Sampling distributions of sample statistics
A sample statistic is a random variable because it varies from sample to sample. The distribution of a sample statistic is called sampling distribution.
The sampling distribution of a sample statistic has mean approximately equal to the population mean and standard deviation called standard error (SE).
Steps to construct the sampling distribution of a sample statistic (e.g., sample mean, sample proportion, etc.):
- Take a random sample of size \(n\) from the population
- For each sample, calculate the sample statistic (e.g., sample mean, sample proportion, etc.)
- Return the sample observations to the population and repeat the process many times
- If (theoretically) we choose every possible sample of size \(n\) from the population and calculate the sample statistic, we obtain the sampling distribution of the sample statistic
2.3 The Central Limit Theorem (CLT)
The Central Limit Theorem states that for any given population distribution with mean \(\mu\) and standard deviation \(\sigma\), if we randomly take sufficiently large (\(n \geq 30\)) random samples from the population with replacement, the sampling distribution of the sample mean \(\bar X\) is normally distributed with mean \(\mu\) and standard error SE = \(\sigma/\sqrt{n}\).
When we choose many random samples from a population, the sampling distribution of the sample mean \(\bar X\) is centered at the population mean \(\mu\) and is less spread out than the population distribution.
\[\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]
Example
Consider a sample of size \(n=40\) from a population with mean \(\mu = 30\) and standard deviation \(\sigma=10\). What are the mean and standard deviation of the sample mean?
Solution:
\(\mu_{\bar X} = \mu = 30\)
\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{40}\)
\(\bar X \sim N(\mu = 30, \sigma = 10/\sqrt{40})\)
Example
The mean height of a population is 160cm with standard deviation equal to 10. If we sample 50 people, what is the probability that the average height of the sample is greater than 170cm?
Solution:
\(\mu_{\bar X} = \mu = 160\)
\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{50}\)
\(\bar X \sim N(\mu = 160, \sigma = 10/\sqrt{50})\)
The probability that the average height of the sample is greater than 170cm is \(P(\bar X > 170)\).
2.3.1 Simulation Central Limit Theorem
Consider a population with mean \(\mu = 10\) and standard deviation \(\sigma = 3\) (size N = 20000).
Choose a large number of samples of size \(n=50\) from the population
Sample 1
[1] 9.226627 8.219675 11.289365 6.941645 13.079813 5.213394 8.847863
[8] 8.478691 12.193100 9.820603 14.353447 10.764433 4.668874 8.730043
[15] 9.391888 10.182957 7.132428 8.383272 6.484136 8.715128 7.582794
[22] 10.747343 6.756467 8.706496 10.403319 19.617679 10.204692 6.789934
[29] 7.063639 8.082506 10.980000 11.350304 9.871466 8.121133 10.673245
[36] 8.686724 13.068695 11.412490 10.723325 12.681606 7.580481 13.167183
[43] 4.440659 9.120458 12.999415 12.058967 6.366579 13.806602 9.560454
[50] 8.087458
Sample 2
[1] 10.771710 9.249494 11.771304 10.399494 11.918498 17.593210 17.130582
[8] 11.928867 8.768686 11.993739 11.885861 11.679749 9.067607 11.417243
[15] 11.222051 15.291144 11.882979 9.598734 9.247371 10.238438 9.151422
[22] 5.940615 11.381974 6.582338 7.784021 7.617595 8.664451 9.881700
[29] 14.861945 10.364972 12.416383 12.840554 14.695685 14.565459 12.612554
[36] 14.103453 2.992181 10.392556 12.292380 11.421163 14.040303 13.254904
[43] 6.216172 16.156826 11.181622 11.421163 10.161452 8.550175 11.759891
[50] 11.185458
…
Sample 1000
[1] 11.290376 8.510343 10.991390 7.999690 11.177523 10.692093 11.676804
[8] 6.798351 5.650451 12.694881 14.252034 7.402828 12.853526 8.007245
[15] 13.492490 14.068188 8.336471 13.464220 10.841278 8.682815 12.620481
[22] 12.537446 13.339643 16.921980 9.325990 11.240528 11.580733 1.558800
[29] 5.848795 13.860090 8.546732 11.266540 14.105449 5.763817 9.284455
[36] 6.150474 9.272322 11.269438 4.051140 3.479343 12.430022 7.847370
[43] 13.254236 8.969691 11.852406 7.915466 10.366838 11.966757 9.507514
[50] 10.610824
Calculate sample means
\(\bar X\) sample 1
[1] 9.65659
\(\bar X\) sample 2
[1] 11.15096
…
\(\bar X\) sample 1000
[1] 10.11257
- The mean of the sample means is equal to the mean of the population. \[\mu_{\bar X} = \mu\] (independent of sample size)
\(\mu_{\bar X}\)
[1] 10.04273
\(\mu\)
[1] 10.02998
- The standard deviation of sample means is equal to the standard deviation of the population divided by the square root of the sample size \[\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\]
\(\sigma_{\bar X}\)
[1] 0.4347364
\(\frac{\sigma}{\sqrt{n}}\)
[1] 0.4226433
If the population is normal, then the sample means will have a normal distribution
If the population is not normal (can be skewed, discrete, etc) but the sample is sufficiently large (\(n \geq 30\)), then the sampling distribution of sample means approximates a normal distribution.
Sampling distribution of sample means
2.3.2 Simulation Central Limit Theorem
We illustrate the Central Limit Theorem (CLT) by means of a simulation. The CLT applies to any distribution. To illustrate the CLT we can consider any population, it does not necessarily need to be normal. For example, we consider the following population which consists of 100000 numbers drawn from two different normal distributions.
<- c(rnorm(20000, mean = 10, sd = 3),
pop rnorm(80000, mean = 70, sd = 10))
hist(pop, main = "Population")
The mean and standard deviation of the population are the following:
mean(pop)
[1] 57.97226
sd(pop)
[1] 25.63992
Now we randomly select 1000 samples of size \(n=50\).
<- 50
n <- NULL
mats for(i in 1:1000){
<- sample(pop, n, replace = TRUE)
s <- cbind(mats, s)
mats }
We calculate the mean of each sample and plot the sampling distribution of the sample mean.
<- apply(mats, 2, mean)
samplemeans hist(samplemeans, main = "Sample mean")
We calculate the mean and standard deviation of the sampling distribution of the sample mean and check the CLT.
Sample size \(n \geq 30\).
Sampling distribution of sample mean approximates to a normal distribution.
Mean of the sample mean is \(\mu_{\bar X} = \mu\)
mean(samplemeans)
[1] 58.0021
mean(pop)
[1] 57.97226
- Standard error (standard deviation of the sample mean) is \(\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\)
sd(samplemeans)
[1] 3.578209
sd(pop)/sqrt(n)
[1] 3.626033