In this vignette, we work through an example Z-test, and point out a number of points where you might get stuck along the way.
Problem setup
Let’s suppose that a student is interesting in estimating how many memes their professors know and love. So they go to class, and every time a professor uses a new meme, they write it down. After a year of classes, the student has recorded the following meme counts, where each count corresponds to a single class they took:
The student talks to some other students who’ve done similar studies and determines that is a reasonable value for the standard deviation of this distribution.
Assumption checking
Before we can do a Z-test, we need to make check if we can reasonably treat the mean of this sample as normally distributed. This happens is the case of either of following hold:
- The data comes from a normal distribution.
- We have lots of data. How much? Many textbooks use 30 data points as a rule of thumb.
Since we have a small sample, we should check if the data comes from a normal distribution using a normal quantile-quantile plot.
Since the data lies close the line , and has no notable systematic deviations from line, it’s safe to treat the sample as coming from a normal distribution. We can proceed with our hypothesis test.
Null hypothesis and test statistic
Let’s test the null hypothesis that, on average, professors know 3 memes. That is
First we need to calculate our Z-statistic. Let’s do this with R. Remember that the Z-statistic is defined as
Calculating p-values
In R this looks like:
n <- length(x)
# calculate the z-statistic
z_stat <- (mean(x) - 3) / (2 / sqrt(n))
z_stat
#> [1] 2.371708
To calculate a two-sided p-value, we need to find
To do this we need to c.d.f. of a standard normal
library(distributions3)
Z <- Normal(0, 1) # make a standard normal r.v.
1 - cdf(Z, 2.37) + cdf(Z, -2.37)
#> [1] 0.01778809
Note that we saved z_stat
above so we could have also
done
which is slightly more accurate since there is no rounding error.
So our p-value is about 0.0177. You should verify this with a
Z-table. Note that you should get the same value from
cdf(Z, 2.37)
and looking up 2.37
on a
Z-table.
You may also have seen a different formula for the p-value of a two-sided Z-test, which makes use of the fact that the normal distribution is symmetric:
Using this formula we get the same result:
2 * cdf(Z, -2.37)
#> [1] 0.01778809
Finally, sometimes we are interest in one sided Z-tests. For the test
the p-value is given by
which we calculate with
1 - cdf(Z, 2.37)
#> [1] 0.008894043
For the test
the p-value is given by
which we calculate with
cdf(Z, 2.37)
#> [1] 0.991106
Rejection regions
Preface: I am strongly opposed to make a dichotomous “reject/fail to reject” decision for hypothesis tests. If you do a hypothesis test, you should report the p-value, period. Picking an arbitrary level rejection threshold and treating it as a gold standard is ridiculous, as evidenced by 60 years of statistical literature laden with warnings about hypothesis testing. That said, sometimes it can be useful to think about when you reject a hypothesis test.
We can think about three different rejection regions for a Z-test:
- The rejection region in terms of the p-value
- The rejection region in terms of the test statistic
- The rejection region in terms of the sample mean
For a given level threshold, all of these rejection regions are equivalent. We’ll start by thinking about the rejection of a two-sided test. That is
We then calculate a test statistic , and a p-value and reject when . This defines our first rejection region. Using our observation from before, this is exactly equivalent to rejecting when
and this last statement is exactly the same as when or . This is our second region region, in terms of the test statistic. Finally, recall that
So we take the conditions and rearrange to in terms of
and
You can also think about this in terms of . We will reject the test when is not in
which you may recognize as the confidence interval for ! So the confidence interval contains all the values of that we cannot reject at the level. You can perform a similar calculation for a one sided test, resulting in a one-sided confidence bound, where one end of the interval is either or .
Power and sample size calculations
Formulas for power
We want to make sure that we actually reject our null hypothesis in the case that it is false. That is, we would like to make sure that our test has high power. Mathematically, this means that . The problem here is that can be wrong in many different ways: it could be that the true is , it could be , it could be . So to calculate power as formulated above is not really possible. However, we can calculate power for specific versions of “ is false”.
Let’s consider the case that is false, and in particular the true value of is . In this case, the power of our test is . Recall that we reject when or , so the power of our test when is
Remember that . This means that, given , , which let’s us calculate the probabilities we need to find the power:
and similarly
So, the power of our test, if the true population mean is , is
Let’s calculate this if .
power_lower <- (3 - 5) / (2 / sqrt(10)) + quantile(Z, 0.025)
power_upper <- (3 - 5) / (2 / sqrt(10)) + quantile(Z, 1 - 0.025)
cdf(Z, power_lower) + (1 - cdf(Z, power_upper))
#> [1] 0.8853791
This means that the probability that we reject the null hypothesis () if the true mean is is about 0.89.
Formulas for sample size calculations
Often times researchers like to go in the other direction: aim for a specific level of power, and calculate how many observations are needed to reach that level. To achieve a power of for a one sample Z-test with , you need
$$ n \approx \left( { \sigma \cdot (z_{\alpha / 2} + z_\beta) \over \mu_0 - \mu_A } \right)^2 $$
samples. If is not an integer, round up. Often, the denominator is thought of as the detectable difference. So, the question becomes how many samples are required to have sufficient power to detect a difference of some particular size.
This equation is simply a rewrite of the equation presented above for power. Recall, the power for a two sided test is
Usually, only one of these terms is contributing while the other is very close to zero. Let’s say the first term is the one clearly different from zero. To determine the sample size, we want to determine such that . Or, similarly, . I.e. we need .
Since , we have the equation above:
As an example, say the student prior to the experiment had determined that they wanted to test if the number of memes their professors know and love is 2. They want to make sure their sample size is large enough so that they are likely to reject the null hypothesis if the true number is 3. They determine that they want a probability of 0.9 of rejecting the null if the true number is 3. So, a sample size calculation looks like this:
$$ n \approx \left( { 2 \cdot (1.96 + 1.28) \over 2-3 } \right)^2 = 41.99 $$
So to make sure that they reject the null hypothesis with a probability of 0.9 if the true value is 3, they would have to ask 53 professors.
Below is this same calculation done in R
. Remember,
.
Note the small discrepancy. This is due to rounding error.