One sample Z-tests for a proportion
Source:vignettes/one-sample-z-test-for-proportion.Rmd
one-sample-z-test-for-proportion.Rmd
In this vignette, we work through an example Z-test a proportion, and point out a number of points where you might get stuck along the way.
Problem setup
Let’s suppose that a student is interesting in estimating what percent of professors in their department watches Game of Thrones. They go to office hours and ask each professor and it turns out 17 out of 62 professors in their department watch Game of Thrones. Several of the faculty think Game of Thrones is a board game.
We can imagine that the data is a bunch of zeros and ones, where the data point, is one if professor watches Game of Thrones, and zero otherwise. So the full dataset might look something like:
But it is much easier to just remember that there are 17 ones and 45 zeros.
Assumption checking
Before we can do a Z-test, we need to make check if we can reasonably treat the mean of this sample as normally distributed. The data is definitely not from a normal distribution since it’s only zeros and ones, so we need to check if the central limit theorem kicks in.
Most of the time we would check if there were 30 data points or more, but for a proportion, we do something slightly different. When data is binary, like we have here, the central limit theorem kicks in slower than usual. The standard thing to check is whether
Where is the sample size (62 in our case) and is the sample average. Note that some textbooks might use rather than . In our case we have , and
So we’re good to go.
Null hypothesis and test statistic
Let’s test the null hypothesis that, on average, twenty percent of professors what Game of Thrones. The corresponding null hypothesis is
First we need to calculate our Z-statistic. Remember that the Z-statistic for proportion is
Calculating p-values
In R this looks like:
n <- 62
pi <- 17 / 62
pi_0 <- 0.2
# calculate the z-statistic
z_stat <- (pi - pi_0) / sqrt(pi_0 * (1 - pi_0) / n)
z_stat
#> [1] 1.460501
To calculate a two-sided p-value, we need to find
To do this we need to c.d.f. of a standard normal
library(distributions3)
#>
#> Attaching package: 'distributions3'
#> The following object is masked from 'package:stats':
#>
#> Gamma
#> The following object is masked from 'package:grDevices':
#>
#> pdf
Z <- Normal(0, 1) # make a standard normal r.v.
1 - cdf(Z, 1.46) + cdf(Z, -1.46)
#> [1] 0.1442901
Note that we saved z_stat
above so we could have also
done
which is slightly more accurate since there is no rounding error.
So our p-value is about 0.14. You should verify this with a Z-table.
Note that you should get the same value from
cdf(Z, 1.46)
and looking up 1.46
on a
Z-table.
You may also have seen a different formula for the p-value of a two-sided Z-test, which makes use of the fact that the normal distribution is symmetric:
Using this formula we get the same result:
2 * cdf(Z, -1.46)
#> [1] 0.1442901