Intro to hypothesis testing
Source:vignettes/intro-to-hypothesis-testing.Rmd
intro-to-hypothesis-testing.Rmd
Hypothesis testing is all about answering the question: for a parameter , is a parameter value of consistent with the data in our observed sample?
We call this is the null hypothesis and write
where this means that true (population) value of a parameter is equal to some value .
What do we do next? We assume that in the population, and then check if this assumption is compatible with our observed data. The population with corresponds to a probability distribution, which we call the null distribution.
Let’s make this concrete. Suppose that we observe data and we know that our data comes from a normal distribution with known variance . Realistically, we won’t know , or that our data is normal, but we’ll work with these assumptions for now and relax them later.
Let’s suppose we’re interested in the population mean. Let’s guess that the population mean is 8. In this case we would write the null hypothesis as . This is a ridiculous guess for the population mean given our data, but it’ll illustrate our point. Our null distribution is then .
Now that we have a null distribution, we need to dream up a test statistic. In this class, you’ll always be given a test statistic. For now we’ll use the T statistic.
$$ Z = {\bar x - \mu_0 \over \mathrm{se}\left(\bar x \right)} = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {4 \over \sqrt \frac 23} \approx 4.9 $$
Recall: a statistic is a function from a random sample into the real line. Since statistics are functions of random samples, they are themselves random variables.
Test statistics are chosen to have two important properties:
- They need to relate to the population parameter we’re interested in measuring
- We need to know their sampling distributions
Sampling distributions you say! Why do test statistics have sampling distributions? Because we’re just taking a function of a random sample.
For this example, we know that
and now we ask how probable is this statistic given that we have assumed that null distribution is true.
The idea is that if this number is very small, then our null distribution can’t be correct: we shouldn’t observe highly unlikely statistics. This means that hypothesis testing is a form of falsification testing.
For the example above, we are interested in the probability of observing a more extreme test statistic given the null distribution, which in this case is:
This probability is called a p-value. Since it’s very small, we conclude that the null hypothesis is not realistic. In other words, the population mean is statistically distinguishable from 8 (whether or not it is practically distinguishable from 8 is entirely another matter).
This is the just of hypothesis testing. Of course there’s a bunch of other associated nonsense that obscures the basic idea, which we’ll dive into next.
Things that can go wrong
False positives
We need to be concerned about rejecting the null hypothesis when the null hypothesis is true. This is called a false positive or a Type I error.
If the null hypothesis is true, and we calculate a statistic like we did above, we still expect to see a value of p-value of about percent of the time. For small p-values this isn’t an issue, but let’s consider a different null hypothesis of . Now
$$ Z = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {4 - 3.9 \over \sqrt \frac 23} \approx 0.12 $$
and our corresponding p-value is
and we see that this is quite probable! We should definitely not reject the null hypothesis!
This leads us to a new question: when should we reject the null hypothesis? A standard choice is to set an acceptable probability for a false positive . One arbitrary but common choice is to set , which means we are okay with a ${1 \over 20}$ chance of a false positive. We should then reject the null hypothesis when the p-value is less than . This is often called “rejecting the null hypothesis at significance level ”. More formally, we might write
False negatives
On the other hand, we may also fail to reject the null hypothesis when the null hypothesis is in fact false. We might just not have enough data to reject the null, for example. We call this a false negative or a Type II error. We write this as
To achieve a power of for a one sample Z-test, you need
$$ n \approx \left( { \sigma \cdot (z_{\alpha / 2} + z_\beta) \over \mu_0 - \mu_A } \right)^2 $$
where is the true mean and is the proposed mean. We’ll do an exercise later that will help you see where this comes from.
Examples
Z-test
A company claims battery lifetimes are normally distributed with and hours. We are curious if the claim about the mean is reasonable, and collect a random sample of 100 batteries. The sample mean is 39.8. What is the p-value of a Z-test for ?
We begin by calculating a Z-score
$$ Z = {\bar x - \mu_0 \over {\sigma \over \sqrt n}} = {39.8 - 40 \over {5 \over \sqrt 100}} = 0.4 $$
and then we calculate, using the fact that ,
we might also be interested in a one-sided test, where . In this case the p-value is only the case when , and the p-value is
Power for Z-test
Suppose a powdered medicine is supposed to have a mean particle diameter of micrometers, and the standard deviation of diameters stays steady around 1.8 micrometers. The company would like to have high power to detect mean thicknesses 0.2 micrometers away from 15. When , what is the power of the test if the true is 15.2 micrometers. Assume the company is interested in controlling type I error at an level.
We will reject the null when our Z score is less than or , or when the Z score is less than -1.96 or greater than 1.96. Recall that the Z score is ${\bar x - \mu_0 \over {\sigma \over \sqrt n}}$, which we can rearrange in terms of to see that we will reject the null when or .
Now we are interested in the probability of being in this rejection region when the alternative hypothesis is true.
and we know that so this equals
So we have only a power of about 20 percent. This is quite low.