R Syntax

# Choose a file interactively
file.choose()

# Set header = TRUE if the first row of the data
# corresponds to the names of the variables

# Vector with values 3, 6, 7
c(3, 6, 7)

# Value of data d in row 3 and column 7
d[3, 7]

# Row 3 of data d
d[3, ]

# Column 7 of data d
d[, 7]

# Rows 3 and 5 of data d
d[c(3, 5), ]

# Columns 7 and 9 of data d
d[ , c(7, 9)]

# Data d without rows 3 and 5
d[-c(3, 5), ]

# Data d without columns 7 and 9
d[ , -c(7, 9)]

R

If you do not have R installed, you can google “online R compiler”

Datasets

# 1 Probability distributions and the Central Limit Theorem

## 1.1 Area under the curve

What percent of a standard normal distribution $$N(\mu = 0, \sigma = 1)$$ is found in each region? Be sure to draw a graph.

1. $$Z < -1.35$$
2. $$Z > 1.48$$
3. $$-0.4 < Z < 1.5$$
4. $$|Z| > 2$$

Solution

1. $$P(Z < -1.35)$$
x <- seq(-5, 5, length.out = 100)
plot(x, dnorm(x), type = "l")
abline(v = -1.35)

pnorm(-1.35)
## [1] 0.08850799
1. $$P(Z > 1.48) = 1- P(Z < 1.48)$$
plot(x, dnorm(x), type = "l")
abline(v = 1.48)

1-pnorm(1.48)
## [1] 0.06943662
1. $$P(-0.4 < Z < 1.5) = P(Z < 1.5) - P(Z < -0.4)$$
plot(x, dnorm(x), type = "l")
abline(v = c(-0.4, 1.5))

pnorm(1.5)-pnorm(-0.4)
## [1] 0.5886145
1. $$P(|Z| > 2) = P(Z > 2 \mbox{ and } Z < -2)$$
plot(x, dnorm(x), type = "l")
abline(v = c(-2, 2))

$$P(Z > 2) + P(Z < -2) = 2 * P(Z < -2)$$

2*pnorm(-2)
## [1] 0.04550026

We can also calculate $$P(Z > 2) + P(Z < -2) = (1-P(Z < 2)) + P(Z < -2)$$

1 - pnorm(2) + pnorm(-2)
## [1] 0.04550026

## 1.2 Overweight baggage

Suppose weights of the checked baggage of airline passengers follow a nearly normal distribution with mean 45 pounds and standard deviation 3.2 pounds. Most airlines charge a fee for baggage that weigh in excess of 50 pounds. Determine what percent of airline passengers incur this fee.

Solution

$$X \sim N(\mu = 45, \sigma = 3.2)$$

P(X > 50)

1-pnorm(50, mean = 45, sd = 3.2)
## [1] 0.05908512

## 1.3 LA weather

The average daily high temperature in June in LA is 77$$^o$$F with a standard deviation of 5$$^o$$F. Suppose that the temperatures in June closely follow a normal distribution.

1. What is the probability of observing an 83$$^o$$F temperature or higher in LA during a randomly chosen day in June?
2. How cool are the coldest 10% of the days (days with lowest average high temperature) during June in LA?

Solution

$$X \sim N(\mu = 77, \sigma = 5)$$

1. $$P(X > 83)$$
1-pnorm(83, mean = 77, sd = 5)
## [1] 0.1150697
1. Temperature $$x$$ such that $$P(X < x) = 0.10$$
qnorm(0.10, mean = 77, sd = 5)
## [1] 70.59224

## 1.4 GRE scores

The mean score for Verbal Reasoning section for all the Graduate Record Examination (GRE) takers was 151 with a standard deviation of 7, and the mean score for the Quantitative Reasoning was 153 with a standard deviation of 7.67. Suppose that both distributions are nearly normal.

1. Write down the short-hand for these two normal distributions.
2. The score of a student who scored in the 80th percentile on the Quantitative Reasoning section.
3. The score of a student who scored worse than 70% of the test takers in the Verbal Reasoning section.

Solution

Verbal reasoning: $$X \sim N(\mu = 151, \sigma = 7)$$

Quantitative reasoning: $$Y \sim N(\mu = 153, \sigma = 7.67)$$

1. Score $$x$$ such that $$P(X < x) = 0.80$$
qnorm(0.80, mean = 153, sd = 7.67)
## [1] 159.4552
1. Score $$y$$ such that $$P(Y > y) = 0.70$$, that is $$1 - P(Y < y) = 0.70$$, $$P(Y < y) = 0.30$$
qnorm(0.30, mean = 151, sd = 7)
## [1] 147.3292

## 1.5 Hen eggs (CLT)

The distribution of the number of eggs laid by a certain species of hen during their breeding period has a mean of 35 eggs with a standard deviation of 18.2. Suppose a group of researchers randomly samples 45 hens of this species, counts the number of eggs laid during their breeding period, and records the sample mean. They repeat this 1,000 times, and build a distribution of sample means.

1. What is this distribution called?
2. Would you expect the shape of this distribution to be symmetric, right skewed, or left skewed? Explain your reasoning.
3. Calculate the variability of this distribution and state the appropriate term used to refer to this value.
4. Suppose the researchers’ budget is reduced and they are only able to collect random samples of 30 hens. The sample mean of the number of eggs is recorded, and we repeat this 1,000 times, and build a new distribution of sample means. How will the variability of this new distribution compare to the variability of the original distribution?

Solution

1. We are building a distribution of sample statistics, in this case the sample mean. Such a distribution is called a sampling distribution.
2. Because we are dealing with the distribution of sample means, we need to check to see if the Central Limit Theorem applies. Our sample size is greater than 30,and we are told that random sampling is employed. With these conditions met, we expect that the distribution of the sample mean will be nearly normal and therefore symmetric.
3. Because we are dealing with a sampling distribution, we measure its variability with the standard error. $$SE = \sigma/\sqrt{n} = 18.2/\sqrt{45}= 2.713$$
4. $$SE = \sigma/\sqrt{n}$$ The sample means will be more variable with the smaller sample size.

## 1.6 Identify the parameter (Inference)

For each of the following situations, state whether the parameter of interest is a mean or a proportion. It may be helpful to examine whether individual responses are numerical or categorical.

1. In a survey, one hundred college students are asked how many hours per week they spend on the Internet.
2. In a survey, one hundred college students are asked: “What percentage of the time you spend on the Internet is part of your course work?”"
3. In a survey, one hundred college students are asked whether or not they cited information from Wikipedia in their papers.
4. In a sample of one hundred recent college graduates, it is found that 85 percent expect to get a job within one year of their graduation date.

Solution

1. Mean. Each student reports a numerical value: a number of hours.
2. Mean. Each student reports a number, which is a percentage, and we can average over these percentages.
3. Proportion. Each student reports Yes or No, so this is a categorical variable and we use a proportion.
4. Proportion. Each student reports whether or not s/he expects to get a job, so this is a categorical variable and we use a proportion.

## 1.7 Quality control (Inference)

As part of a quality control process for computer chips, an engineer at a factory randomly samples 212 chips during a week of production to test the current rate of chips with severe defects. She finds that 27 of the chips are defective.

1. What population is under consideration in the data set?
2. What parameter is being estimated?
3. What is the point estimate for the parameter?

Solution

1. Population under consideration is all the computer chips in a factory produced during one week
2. The parameter that is being estimated is the proportion of chips with severe defects manufactured at the factory during the week of production
3. A point estimate for the proportion of chips with severe defects is the proportion of chips with severe defects in the sample $$\hat p = 27/212 = 0.127$$