2  Statistical inference and the Central Limit Theorem

Learning objectives
  • Population parameters and sample statistics
  • Sampling distributions
  • The Central Limit Theorem

2.1 Statistical inference

In statistical inference we are interested in learning some quantity representing some feature or parameter of a population. Population is large and we cannot examine all the values in the population directly. Therefore, to learn about the population parameter we take samples from the population and use the information in the samples to draw conclusions about the population.

Population parameters are fixed values. We rarely know the parameter values because it is often difficult to obtain measures from the entire population.

Sample statistics are quantities computed from the values in the sample. They are random variables because they vary from sample to sample.

We use sample statistics to draw conclusions about unknown population parameters.

Example

Suppose we are interested in knowing the mean height of the population living in England. Population is large and we cannot measure the height of everybody directly. We can get an estimate of the mean height by taking a sample of say, 5000 people, and measuring their heights.

We can provide point estimates and interval estimates for population parameters.

  • Population mean is denoted by \(\mu\) and it is unknown

  • Sample size is \(n = 5000\)

  • Heights of people in the sample: \((x_1, x_2, \ldots, x_{5000})\)

  • Point estimate = estimate that specifies a single value of the population.

  • Sample statistic is the average heights of people in the sample: \(\bar x = \sum x_i/n\)

  • Interval estimate = estimate that specifies a range of plausible values for the population parameter. We are 95% confident the population mean \(\mu\) is within \((a, b)\)

Example

A survey is carried out at a university to estimate the proportion of undergraduate students who drive to campus to attend classes. One thousand students are randomly selected and asked whether they drive or not to campus to attend classes. The population is all of the undergraduates at that university. The sample is the group of 1000 undergraduate students surveyed. The parameter is the true proportion of all undergraduate students at that university who drive to campus to attend classes. The statistic is the proportion of the 1000 sampled undergraduates who drive to campus to attend classes.

  • Population proportion \(p\) unknown
  • Sample size \(n = 1000\)
  • Sample statistic is the sample proportion: \(\hat p = \frac{\mbox{number people drive}}{n}\)

Example

A study is conducted to estimate the true mean annual income of all adult residents of California. The study randomly selects 2000 adult residents of California. The population consists of all adult residents of California. The sample is the 2000 residents in the study. The parameter is the true mean annual income of all adult residents of California. The statistic is the mean income of the 2000 residents in this sample.

  • Population mean \(\mu\) unknown
  • Sample size \(n = 2000\)
  • Sample statistic is the sample mean: \(\bar x = \sum x_i/n\)

2.2 Sampling distributions of sample statistics

A sample statistic is a random variable because it varies from sample to sample. The distribution of a sample statistic is called sampling distribution.

The sampling distribution of a sample statistic has mean approximately equal to the population mean and standard deviation called standard error (SE).

Steps to construct the sampling distribution of a sample statistic (e.g., sample mean, sample proportion, etc.):

  1. Take a random sample of size \(n\) from the population
  2. For each sample, calculate the sample statistic (e.g., sample mean, sample proportion, etc.)
  3. Return the sample observations to the population and repeat the process many times
  4. If (theoretically) we choose every possible sample of size \(n\) from the population and calculate the sample statistic, we obtain the sampling distribution of the sample statistic

2.3 The Central Limit Theorem (CLT)

The Central Limit Theorem states that for any given population distribution with mean \(\mu\) and standard deviation \(\sigma\), if we randomly take sufficiently large (\(n \geq 30\)) random samples from the population with replacement, the sampling distribution of the sample mean \(\bar X\) is normally distributed with mean \(\mu\) and standard error SE = \(\sigma/\sqrt{n}\).

When we choose many random samples from a population, the sampling distribution of the sample mean \(\bar X\) is centered at the population mean \(\mu\) and is less spread out than the population distribution.

\[\bar X \sim N\left(\mu, \frac{\sigma}{\sqrt{n}}\right)\]

Example

Consider a sample of size \(n=40\) from a population with mean \(\mu = 30\) and standard deviation \(\sigma=10\). What are the mean and standard deviation of the sample mean?

Solution:

\(\mu_{\bar X} = \mu = 30\)

\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{40}\)

\(\bar X \sim N(\mu = 30, \sigma = 10/\sqrt{40})\)

Example

The mean height of a population is 160cm with standard deviation equal to 10. If we sample 50 people, what is the probability that the average height of the sample is greater than 170cm?

Solution:

\(\mu_{\bar X} = \mu = 160\)

\(\sigma_{\bar X} = \sigma/\sqrt{n} = 10/\sqrt{50}\)

\(\bar X \sim N(\mu = 160, \sigma = 10/\sqrt{50})\)

The probability that the average height of the sample is greater than 170cm is \(P(\bar X > 170)\).

2.3.1 Simulation Central Limit Theorem

Consider a population with mean \(\mu = 10\) and standard deviation \(\sigma = 3\) (size N = 20000).

Choose a large number of samples of size \(n=50\) from the population

Sample 1

 [1]  6.554886 10.152664  6.148774  9.902627  8.913614 14.333313 13.831218
 [8] 10.403515 10.840045 12.895610 14.145057  7.123110  8.509465 12.062193
[15] 11.749468  9.284180  8.800596 12.555485 12.277498 11.136113  9.967396
[22] 12.658969 13.951093 11.400411  7.055950 11.362163 11.145641 10.875201
[29] 12.512283 10.525295 16.966941 12.124966  5.275331 13.954602 10.611630
[36] 14.756350  9.065559 20.005602 11.360291 13.818834 12.134266  6.560214
[43] 15.646552 11.315566  7.687711  7.552061  4.536318 12.785397  8.946024
[50]  9.123675

Sample 2

 [1]  7.620981  8.860497  8.855474 10.667563 11.613753 13.133705 14.907489
 [8]  5.920571  6.598431 13.845083  7.462270 13.419940  9.368841  9.673253
[15] 10.562266 12.107387 14.482989 11.724864  1.472879 13.732994  9.246942
[22] 10.357292 10.228058  6.088919 14.316008 12.636181 16.058293 13.567325
[29]  8.733851 12.052399  7.396530 10.503712 10.019062  8.203447  8.817919
[36]  9.916171 11.272902 11.791261 12.827883  4.853952 11.718628 10.666439
[43]  5.720364  9.248669  9.659447 10.153070 11.180060  7.361005  7.927574
[50]  7.916270

Sample 1000

 [1] 14.291950  5.074071 10.607358 15.163585 11.546604  9.126490 10.471694
 [8]  9.050994  7.883660 13.722041  8.796384 10.019421 16.606635 11.508701
[15] 10.818462  9.411843 11.963203 13.426046 11.743012  9.044704  7.317571
[22] 11.620353  6.250449 11.829777 16.604148  8.622141 10.405085 10.296743
[29]  9.740882  7.198587 10.840045 10.297996 16.304116  7.921005 11.787259
[36]  7.195109 10.125985 14.328942  6.964686  9.241014 10.545755 12.247520
[43] 11.656631 14.752071 12.764227  8.578379  7.745749  7.594452 10.639104
[50]  9.418979

Calculate sample means

\(\bar X\) sample 1

[1] 10.94603

\(\bar X\) sample 2

[1] 10.12942

\(\bar X\) sample 1000

[1] 10.62223
  • The mean of the sample means is equal to the mean of the population. \[\mu_{\bar X} = \mu\] (independent of sample size)

\(\mu_{\bar X}\)

[1] 9.992325

\(\mu\)

[1] 10.00199
  • The standard deviation of sample means is equal to the standard deviation of the population divided by the square root of the sample size \[\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\]

\(\sigma_{\bar X}\)

[1] 0.4309446

\(\frac{\sigma}{\sqrt{n}}\)

[1] 0.422629
  • If the population is normal, then the sample means will have a normal distribution

  • If the population is not normal (can be skewed, discrete, etc) but the sample is sufficiently large (\(n \geq 30\)), then the sampling distribution of sample means approximates a normal distribution.

Sampling distribution of sample means

2.3.2 Simulation Central Limit Theorem

We illustrate the Central Limit Theorem (CLT) by means of a simulation. The CLT applies to any distribution. To illustrate the CLT we can consider any population, it does not necessarily need to be normal. For example, we consider the following population which consists of 100000 numbers drawn from two different normal distributions.

pop <- c(rnorm(20000, mean = 10, sd = 3),
         rnorm(80000, mean = 70, sd = 10))

hist(pop, main = "Population")

The mean and standard deviation of the population are the following:

mean(pop)
[1] 57.99285
sd(pop)
[1] 25.6466

Now we randomly select 1000 samples of size \(n=50\).

n <- 50
mats <- NULL
for(i in 1:1000){
s <- sample(pop, n, replace = TRUE)
mats <- cbind(mats, s)
}

We calculate the mean of each sample and plot the sampling distribution of the sample mean.

samplemeans <- apply(mats, 2, mean)
hist(samplemeans, main = "Sample mean")

We calculate the mean and standard deviation of the sampling distribution of the sample mean and check the CLT.

  • Sample size \(n \geq 30\).

  • Sampling distribution of sample mean approximates to a normal distribution.

  • Mean of the sample mean is \(\mu_{\bar X} = \mu\)

mean(samplemeans)
[1] 57.97094
mean(pop)
[1] 57.99285
  • Standard error (standard deviation of the sample mean) is \(\sigma_{\bar X} =\frac{\sigma}{\sqrt{n}}\)
sd(samplemeans)
[1] 3.608148
sd(pop)/sqrt(n)
[1] 3.626976