Hypothesis testing is a procedure to determine whether a claim about a population parameter is reasonable. Such a claim is called hypothesis.
The alternative or research hypothesis is the research question (e.g., an alternative to the currently accepted value for the population parameter). The alternative is denoted by \(H_1\) or \(H_a\).
The null hypothesis represents a position of “no difference” (e.g., currently accepted value for a population parameter). The null hypothesis is denoted by \(H_0\).
To test a hypothesis, we select a sample from the population and use measurements from the sample and probability theory to determine how well the data and the null hypothesis agree. We want to show the alternative hypothesis is reasonable (by showing there is evidence to reject the null hypothesis).
Example
Research question we are trying to answer
Does a new drug reduce cholesterol?
Alternative hypothesis (investigator’s belief)
The new drug reduces cholesterol
Null hypothesis (no difference, hypothesis to be rejected)
The new drug has no effect
Let \(\mu_1\) and \(\mu_2\) be the cholesterol levels obtained with and without the drug, respectively.
\(H_0: \mu_1 = \mu_2\) (Null hypothesis)
\(H_1: \mu_1 < \mu_2\) (Alternative hypothesis)
A hypothesis is tested by using the following procedure:
Determine the null and alternative hypotheses.
Select an appropriate sample from the population.
Assume the null hypothesis is true. Use measurements from the sample and probability theory to determine how well the data and the null agree.
We make one of these two decisions:
Example
Suppose we think the mean weight of the oranges in a field is lower than 100 grams. We conduct a hypothesis test to test this claim.
Null and alternative hypothesis
\(H_0\): \(\mu=100\)
\(H_1\): \(\mu < 100\)
We take a sample from the population (for example, a random sample of 100 oranges).
We assume the null hypotehsis is true, and calculate the mean weight of the oranges in the sample to decide whether to reject or fail to reject the null hypothesis.
We use the sample mean in the sample to make a decision.
Null Hypothesis \(H_0\) | Alternative Hypothesis \(H_1\) |
---|---|
Assumed true until evidence indicates otherwise | We try to find evidence for the alternative hypothesis |
“Nothing is going on” | Opposite of the null hypothesis |
There is no relationship between variables being studied | There is relationship between variables being studied |
A particular intervention does not make a difference/has no effect | A particular intervention makes a difference/has an effect |
Hypothesis are written in terms of population parameters. For example, we can write hypothesis for a single mean (\(\mu\)), a single proportion (\(p\)), the difference between two independent means (\(\mu_1-\mu_2\)), and the difference between proportions (\(p_1-p_2\)).
Example
For each of the research questions below, write the null and alternative hypothesis. Remember the alternative hypothesis is the investigator’s belief and the null hypothesis denotes the hypothesis of no difference, the one we wish to reject. In general the null hypothesis contains the equal sign (\(=\)).
Is the average monthly rent of a one-bedroom apartment in Bath less than 800 pounds?
Is the average intelligence quotient (IQ) score of all university students higher than 100?
Is the percent of students enrolled in Faculty of Science who identify as women different from 50%?
Do the majority of all university students own a dog?
In preschool, are the weights of boys and girls different?
Is the proportion of men who smoke cigarettes different from the proportion of women who smoke cigarettes in England?
Possible outcomes for a hypothesis test:
\(H_0\) is True | \(H_0\) is False | |
---|---|---|
Reject null | Type I error (false positive) | Correct (true positive) |
Fail to reject null | Correct (true negative) | Type II error (false negative) |
When making a decision about a hypothesis test we can make two types of errors:
For example, let the null hypothesis \(H_0\) be patient is not pregnant. Then,
Type I and II errors
\(\alpha\) = Probability of Type I error = P(rejecting \(H_0\) | \(H_0\) is true)
\(\beta\) = Probability of Type II error = P(failing to reject \(H_0\) | \(H_0\) is false)
\(\alpha\) and \(\beta\) are related. As one increases, the other decreases.
For large sample sizes, it is more likely to get significant results. Small differences may be significant so we should check if the difference is meaningful.
A small sample size might lead to frequent Type II errors (e.g., a huge difference is needed to conclude a significant difference).
For small sample sizes, we may fail to reject the null even though it is false.
Example
A man goes to trial where he is being tried for a murder. The hypotheses being tested are:
\(H_0\): Not guilty
\(H_1\): Guilty
What are the Type I and II errors that can be committed?
Before testing the hypothesis, we need to select the significance level (\(\alpha\)). The significance level of a test is the maximum value for the probability of Type I Error (incorrectly rejecting the null when it is true) we are willing to tolerate and still call the results statistically significant.
Typical values for \(\alpha\) are 0.05 and 0.01. If we choose \(\alpha = 0.05\), there will be a maximum of 5% chance of incorrectly rejecting the null when it is true.
The confidence level \(c\) represents how confident we are in our decision (e.g., 95%, 99%)
For example, if the confidence level is 95%, the significance level is \(\alpha = 1 - c = 1 - 0.95 = 0.05\).
The power of a test is the probability of correctly rejecting the null when it is false.
Power = P(reject \(H_0\) | \(H_0\) is false) = 1- P(fail to reject \(H_0\) | \(H_0\) is false) = 1 - Probability of Type II error = 1 - \(\beta\)
Power depends on the following factors:
When conducting a hypothesis test, we want both Type I and Type II errors to be small. That is, we want small significance levels (close to 0) and large power (close to 1). Unfortunately, increasing significance decreases power and this makes it difficult to conduct tests with both high significance and power.
\(H_0: \mu=100\)
\(H_1: \mu>100\)
In practice, a one-sided test such as this is tested in the same way as the following test
\(H_0: \mu\leq100\)
\(H_1: \mu>100\)
This is because if we conclude \(\mu>100\), we must also conclude \(\mu>90\), \(\mu>80\), etc.
\(H_0: \mu=100\)
\(H_1: \mu \neq 100\)
Suppose a company manufacturing computer chips claims the defective rate is 5%. Let \(p\) denote the true defective probability. We use a sample of 1000 chips from the production to determine their claim is reasonable. The proportion of defectives in the sample is 8%.
Null and alternative hypothesis
\(H_0: p = p_0\) (proportion of defective chips is 5%)
\(H_1: p > p_0\) (proportion of defective chips is greater than 5%)
where \(p_0=0.05\)
We use a significance level \(\alpha = 0.05\) so the probability of incorrectly rejecting the null when it is true is 5%.
Then we calculate the test statistic which has the general form:
\[\mbox{test statistic} = \frac{\mbox{sample statistic - null parameter}}{\mbox{standard error}}\]
The test statistic is a value calculated from a sample that summarizes the characteristics of the sample and is used to determine whether to reject or fail to reject the null hypothesis. The test statistic differs from sample to sample. The sampling distribution of the test statistic under the null hypothesis must be known so we can compare the results observed to the results expected from the null hypothesis. For example, we use test statistics with normal or t distributions. These distributions have a known area, and enable to calculate p-values (probability that tell if results are due to chance or due to hypothesis being correct).
Sampling distribution of the test statistic under the null hypothesis.
\[Z = \frac{\hat P - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \sim N(0,1)\]
The observed value of the test statistic is \(z_{obs} = \frac{0.08-0.05}{\sqrt{0.05(1-0.05)/1000}}\) = 4.35.
Now we need to decide whether reject the null or fail to reject the null. We reject the null if the observed value of the test statistic is so extreme that it is unlikely to occur if the null is true (unlikely to occur is given by the significance level \(\alpha\)). Otherwise we fail to reject the null. There are two approaches to determine whether the test statistic is extreme and the null hypothesis should be rejected. These are the p-value approach and the critical value approach. We explain these approaches below.
P-value approach: The p-value is the probability of obtaining a test statistic as extreme as or more extreme than the one observed in the direction of the alternative hypothesis, assuming the null hypothesis is true. We calculate the p-value as the area under the N(0,1) curve beyond the test statistic observed in the direction of the alternative hypothesis, \(P(Z > 4.35 | H_0)\). p-value = 6.8068766^{-6}. p-value < \(\alpha\). So we reject \(H_0\).
Critical value approach: We determine the critical value by finding the value of the distribution of the test statistic under the null such that the probability of making a Type I error is the specified significance level \(\alpha = 0.05\). Critical value is the value \(z^*\) such that the probability of N(0,1) is greater than \(z^*\) is 0.05. \(z^*\) = 1.64. The value of the observed test statistic is more extreme in the direction of the alternative hypothesis than the critical value (1.64 \(<\) 4.35). So we reject \(H_0\).
If we fail to reject the null. Does this mean the null is true? Maybe but maybe not.
It could be the null is true. In this case, most of the time we will fail to reject the null (we reject with a probability \(\alpha\) (e.g., 5%)).
It could be the alternative is true. But we did not use a large enough sample to obtain a significant result (or we were just unlucky). Type II error.
If we reject the null. Does this mean the alternative is true? Maybe but maybe not.
It could be the null is true. In this case, we reject the null with probability \(\alpha\) (e.g., 5%). So we were just unlucky.
It could be the alternative is true. Either the sample was large enough to obtain a significant result, or was small but we were just lucky.
We determine the critical value by finding the value of the distribution of the test statistic such that the probability of making a Type I error is the specified significance level \(\alpha\) (e.g., 0.05).
Then we compare the test statistic value to the critical value.
The critical values define the boundaries of the critical or rejection region which is the set of values for which the null hypothesis is rejected. The acceptance region is the set of values that are consistent with the null hypothesis.
If the test statistic value is more extreme in the direction of the alternative than the critical value (falls in the critical region), we reject the null hypothesis.
If the test statistic is not more extreme in the direction of the alternative than the critical value (falls in the acceptance region), we fail to reject reject the null.
In this approach, the strength of evidence in support of the alternative hypothesis is measured by the p-value.
The p-value is the probability of obtaining a test statistic as extreme as or more extreme than the one observed in the direction of the alternative hypothesis, assuming the null hypothesis is true.
We compare the p-value to the significance level \(\alpha\) and decide whether to reject or fail to reject the null.
The p-value is the probability of observing a test statistic as extreme as or more extreme than the one observed in the direction of the alternative hypothesis, assuming that the null hypothesis is true.
To calculate the p-value we compute the test statistic value using the sample data, and calculate the area under the test statistic distribution curve beyond the test statistic value.
In left-tailed tests, the p-value is calculated as the area under the test statistic distribution curve to the left of the test statistic.
In right-tailed tests, the p-value is calculated as the area under the test statistic distribution curve to the right of the test statistic.
In two-tailed tests, the p-value is calculated by using the symmetry of the test distribution curve. We find the p-value for a one-sided test and double it.
State the null and alternative hypotheses.
The alternative hypothesis is the investigator’s belief and the null hypothesis denotes the hypothesis of no difference, the one we wish to reject. The null hypothesis contains the equal sign (\(=\)).
Choose the significance level (maximum value for the probability of incorrectly rejecting the null when it is true we are willing to tolerate)
Usually \(\alpha = 0.05\)
Calculate the test statistic.
\(\mbox{test statistic} = \frac{\mbox{sample statistic - null parameter}}{\mbox{standard error}}\)
Find the p-value for the observed data.
The p-value is the probability of obtaining a test statistic as extreme as or more extreme than the one observed in the direction of the alternative hypothesis, assuming the null hypothesis is true.
Make one of these two decisions and state a conclusion.
Below we give examples of the following tests:
Population parameter | Sample statistic | Test statistic | |
---|---|---|---|
One Mean | \(\mu\) | \(\bar X\) | \(T=\frac{\bar X - \mu_0}{S/\sqrt{n}} \sim t(n-1)\) |
One Proportion | \(p\) | \(\hat P\) | \(Z = \frac{\hat P - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \sim N(0,1)\) |
Difference in two proportions | \(p_1 - p_2\) | \(\hat P_1 - \hat P_2\) | \(Z = \frac{(\hat P_1 - \hat P_2)-0}{\sqrt{\frac{\hat P (1-\hat P)}{n_1} + \frac{\hat P (1-\hat P)}{n_2}}} \sim N(0,1)\) |
Difference in two means (independent samples) | \(\mu_1 - \mu_2\) | \(\bar X_1 - \bar X_2\) | \(T = \frac{(\bar X_1 - \bar X_2) - 0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2} }} \sim t(min(n_1-1,n_2-1))\) |
Difference in two means (paired samples) | \(\mu_{diff}\) | \(\bar X_{diff}\) | \(T= \frac{\bar X_{diff}}{S_{diff}/\sqrt{n}} \sim t(n-1)\) |
https://ismayc.github.io/moderndiver-book/B-appendixB.html#one-mean
The National Survey of Family Growth conducted by the Centers for Disease Control gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health. One of the variables collected on this survey is the age at first marriage. 5,534 randomly sampled US women between 2006 and 2010 completed the survey. The women sampled here had been married at least once. The sample mean is 23.44 and the standard deviation 4.72. Do we have evidence that the mean age of first marriage for all US women from 2006 to 2010 is greater than 23 years?
Data: Sample size \(n=5534\), sample mean \(\bar x_{obs}=23.44\), standard deviation \(s=4.72\).
\(H_0: \mu = 23\)
\(H_1: \mu > 23\)
(\(\mu_0\) is 23)
Guess about statistical significance
We want to see how likely is it to have observed a sample mean of \(\bar x_{obs}=23.44\) or more extreme assuming that the population mean is 23 (assuming the null is true). They seem to be quite close, but we have a large sample size. Let’s guess that the large sample size will lead us to reject this practically small difference.
\(\alpha = 0.05\)
A good guess to estimate the population mean \(\mu\) is the sample mean \(\bar X\). Assuming the null is true, we can standardize this original test statistic \(\bar X\) into a \(T\) statistic that follows a \(t\) distribution with degrees of freedom equal df\(=n-1\).
\[T=\frac{\bar X - \mu_0}{S/\sqrt{n}} \sim t(n-1)\]
Assumptions: Independent observations (random sample) and normality (normality or sample size \(\geq\) 30)
<- 5534
n <- 23.44
barx <- 4.72
s <- (barx-23)/(s/sqrt(n))
t t
## [1] 6.934741
Probability of observing a test statistic of 6.93 or more extreme in the t-distribution.
<- n-1
df 1 - pt(t, df = df)
## [1] 2.267297e-12
p-value < \(\alpha\), we reject the null. We have found evidence the mean age of first marriage is greater than 23.
https://ismayc.github.io/moderndiver-book/B-appendixB.html#one-proportion
The CEO of a large electric utility claims that 80 percent of his 1,000,000 customers are satisfied with the service they receive. To test this claim, the local newspaper surveyed 100 customers, using simple random sampling. 73 were satisfied and the remaining were unsatisfied. Based on these findings from the sample, can we reject the CEO’s hypothesis that 80% of the customers are satisfied?
Data: Sample size \(n=100\), sample proportion \(\hat p_{obs}=0.73\)
\(H_0: p = 0.80\)
\(H_1: p \neq 0.80\)
(\(p_0=0.80\))
Guess about statistical significance
We want to see if the sample proportion 0.73 is statistically different from \(p_0=0.8\). They seem to be close and the sample size is not big (\(n=100\)). We may guess that we do not have evidence to reject the null hypothesis.
\(\alpha = 0.05\)
A good guess to estimate the population proportion \(p\) is the sample proportion \(\hat P\). Assuming the null is true, we can standardize the original test statistic \(\hat P\) into a \(Z\) statistic that follows a \(N(0,1)\).
\[Z = \frac{\hat P - p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \sim N(0,1)\]
Assumptions: Independent observations (random sample) and number of expected successes (73) and failures (27) are both greater than 10
<- 100
n <- 0.73
p_hat <- 0.8
p0 <- (p_hat-p0)/sqrt((p0*(1-p0))/n)
z z
## [1] -1.75
Probability of observing a test statistic value of -1.75 or more extreme (in both directions) in the null distribution.
2*pnorm(z)
## [1] 0.08011831
p-value > \(\alpha\), we fail to reject the null. We do not find enough evidence to reject the null.
https://ismayc.github.io/moderndiver-book/B-appendixB.html#two-proportions
A 2010 survey asked 827 randomly sampled registered voters in California “Do you support? Or do you oppose? Drilling for oil and natural gas off the Coast of California? Or do you not know enough to say?” Conduct a hypothesis test to determine if the data provide strong evidence that the proportion of college graduates who do not have an opinion on this issue is different than that of non-college graduates.
Data
Sample size \(n=827\)
no opinion | opinion | |
---|---|---|
college | 131 | 258 |
no college | 104 | 334 |
\(H_0: p_c = p_{nc}\)
\(H_1: p_c \neq p_{nc}\)
or
\(H_0: p_c - p_{nc} = 0\)
\(H_1: p_c - p_{nc} \neq 0\)
Guess about statistical significance
We want to know if there is a statistically significant difference between the college proportion and the non-college proportion. The proportions are close:
We could guess there is not evidence to reject the null.
\(\alpha = 0.05\)
We are interested in seeing if the observed difference in sample proportions corresponding to no opinion (\(\hat p_{1obs} - \hat p_{2obs}\)) is statistically different from 0. Assuming the null is true, we can use the standard normal distribution to standardize the difference in sample proportions
\[Z = \frac{(\hat P_1 - \hat P_2)-0}{\sqrt{\frac{\hat P (1-\hat P)}{n_1} + \frac{\hat P (1-\hat P)}{n_2}}} \sim N(0,1)\]
where \(\hat P = \frac{\mbox{total number of successes}}{\mbox{total number of cases}}\)
Assumptions: Independent observations (random sample) and number of pooled successes and pooled failures at least 10 for each group (\(n \hat p \geq 10\) and \(n (1 - \hat p) \geq 10\)). Pooled success rate: \(\hat p\) = (131+104)/827 = 0.28, \(1-\hat p\) = 0.72. \(\hat p \times (131+258) = 108.92\), \((1-\hat p) \times (131+258) = 280.08\), \(\hat p \times (104+334) = 122.64\), \((1-\hat p) \times (104+334) = 315.36\)
<- 131+258
n1 <- 131/(131+258)
p1 <- 104+334
n2 <- 104/(104+334)
p2 <- (131+104)/(131+258+104+334)
phat <- (p1-p2)/sqrt(phat*(1-phat)/n1 + phat*(1-phat)/n2)
z z
## [1] 3.160806
Probability observing a test statistic equal to 3.16 or more extreme (in both directions) when the null is true.
2 * (1-pnorm(z))
## [1] 0.001573334
p-value < \(\alpha\), we reject the null. There is evidence proportions are different. We do have evidence to suggest that there is a dependency between college graduation and position on offshore drilling for Californians. Our initial guess was wrong!
https://ismayc.github.io/moderndiver-book/B-appendixB.html#two-means-independent-samples
Average income varies from one region of the country to another, and it often reflects both lifestyles and regional living expenses. Suppose a new graduate is considering a job in two locations, Cleveland, OH and Sacramento, CA, and he wants to see whether the average income in one of these cities is higher than the other. He would like to conduct a hypothesis test based on two randomly selected samples from the 2000 Census.
Data:
Ohio: \(n_1= 212\), \(\bar x_1= 27467\), \(s_1= 27681\)
Cleveland: \(n_2= 175\), \(\bar x_2= 32428\), \(s_2= 35774\)
\(H_0: \mu_1 = \mu_2\)
\(H_1: \mu_1 \neq \mu_2\)
or
\(H_0: \mu_1- \mu_2 = 0\)
\(H_1: \mu_1 - \mu_2 \neq 0\)
Guess about statistical significance
We want to see if the average income in Cleveland is statistically different than the average income in Sacramento. In the sample we observe
Ohio: \(n_1= 212\), \(\bar x_1= 27467\), \(s_1= 27681\)
Cleveland: \(n_2= 175\), \(\bar x_2= 32428\), \(s_2= 35774\)
The distributions seem similar. We guess there is not enough evidence to support the average incomes are different.
\(\alpha = 0.05\)
We wish to see if the observed difference in sample means is statistically different than 0. Assuming the null is true, we can use the \(t\) distribution to standardize the difference in the sample means \(\bar X_1 - \bar X_2\).
\[T = \frac{(\bar X_1 - \bar X_2) - 0}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2} }} \sim t(min(n_1-1,n_2-1))\] Assumptions: Independent observations and normality (distribution normal or samples sizes \(\geq\) 30)
<- 212
n1 <- 27467
x1 <- 27681
s1
<- 175
n2 <- 32428
x2 <- 35774
s2
<- (x1-x2)/sqrt(s1^2/n1+s2^2/n2)
t t
## [1] -1.500762
Probability of observing a value of the test statistic equal -1.5 to or more extreme (in both directions) assuming the null is true
<- min(n1-1, n2-1) # 174
df 2*(pt(t, df))
## [1] 0.1352296
p-value > \(\alpha\). We fail to reject the null. There is not enough evidence to reject the null.
Note that if we assume that the two samples are drawn from populations with identical population variances. The test statistic is calculated as
\[T = \frac{(\bar X_1 - \bar X_2) - 0}{\sqrt{\frac{S_p^2}{n_1}+\frac{S_p^2}{n_2} }} \sim t(n_1+n_2-2)\]
where \(S_p^2\) is an estimator of the pooled variance of the two groups.
\[S_p^2 = \frac{(n_1-1)S_1^2 + (n_2-1)S_2^2 }{n_1+n_2-2}\]
https://ismayc.github.io/moderndiver-book/B-appendixB.html#two-means-paired-samples
We can also conduct hypothesis tests with paired data. If data are paired, and the response variable is quantitative, then the outcome of interest is the mean difference. In a population this is \(\mu_{diff}\) and in a sample \(\bar x_{diff}\). We would first compute the differences for each case, then treat those differences as if they are the variable of interest and conduct a single sample mean test.
Trace metals in drinking water affect the flavor and an unusually high concentration can pose a health hazard. Ten pairs of data were taken measuring zinc concentration in bottom water and surface water at 10 randomly selected locations on a stretch of river. Do the data suggest that the true average concentration in the surface water is smaller than that of bottom water?
Data: \(n=10\), \(\bar x_{diff}\) = -0.08, sd =0.052
\(\mu_1\): average concentration in surface water
\(\mu_2\): average concentration in bottom water
\(H_0: \mu_1 = \mu_2\)
\(H_1: \mu_1 < \mu_2\)
or
\(H_0: \mu_{diff} = 0\)
\(H_0: \mu_{diff} < 0\)
\(\mu_{diff}\) is the mean difference in concentration for surface water minus bottom water.
Guess about statitical significance.
We want to know if there is a statistical significant difference in concentration between surface water and bottom water.
We want to know if the sample paired mean difference of -0.08 is statistically lower than 0. The difference seems close to 0 but the sample size is small (\(n=10\)). We guess there is not evidence to reject the null.
\(\alpha = 0.05\)
An estimate for the population mean difference \(\mu_{diff}\) is the sample mean difference \(\bar X_{diff}\). Assuming the null is true, we can standardize the original test statistic \(\bar X_{diff}\) into a \(T\) statistic that follows a \(t\) distribution with degrees of freedom df = \(n-1\).
\[T= \frac{\bar X_{diff}}{S_{diff}/\sqrt{n}} \sim t(n-1)\] Assumptions: Independent observations (observations among pairs are idependent) and normality (distribution population of differences is normal or pairs \(\geq\) 30). Here we also only have 10 pairs which is fewer than the 30 needed. We would need to check data are normal!
<- 10
n <- -0.08
xdiff <- 0.052
sdiff <- xdiff/(sdiff/sqrt(n))
t t
## [1] -4.865043
Probability of obtainig a test statistic equal to -4.86 or lower under the null hypothesis
<- n-1
df pt(t, df)
## [1] 0.0004447997
p-value < \(\alpha\), we reject the null hypothesis. We have evidence that the mean concentration in the surface water is lower than in the bottom water.