Yesterday during a presentation I gave there were a few questions about the interpretation of p-values.  My opinion is that classical hypothesis testing confuses many people and there a few things that are counterintuitive. After the presentation, I corresponded with one of the participants and ended up creating a Monte Carlo simulation to help illustrate hypothesis testing and p-value interpretation. It seemed like a logical thing to turn into a blog post.

Peter Kennedy’s Guide to Econometrics has a great treatment of Monte Carlos and explains many topics in econometrics through the lens of Monte Carlos. I also had a time series professor who illustrated topics with Monte Carlos and I found that this approach helped solidify these topics in my mind.

Two things follow: a link to my Github with output from a Monte Carlo simulation that I ran in Stata and a write up based on my correspondence with the participant.

Here’s my take on it: We normally use a null hypothesis of 0 when we don’t know what to expect. Some people argue a null of 0 is silly, but it’s actually a strong number to use. A counter example would be using a null hypothesis that the parameter estimate is 10,000,000,000. It would typically be very easy to reject this null hypothesis since the test statistic is the parameter estimate from the model minus the hypothesized value divided by the standard error of the parameter estimate. This test would show us that we don’t think the value is 10,000,000,000 and nothing more. That’s not very useful.

So we initialize the hypothesis test at 0 and then see if the model is returning any evidence to the contrary. Even if we think there is an effect, we set the test up with the null assuming there is no effect and then look for any evidence that contradicts the null.

But what the Monte Carlo shows is that we know that even if y and x aren’t related at all we are bound to find some cases where we would reject the null at a given level. At the 5% level, if the null is true and there is no relationship between y and x (i.e., beta_hat actually is equal to 0), then we would expect to reject the null 5% of the time just by chance. The problem is that in a non-Monte Carlo setting we have one shot to run our model and can’t be sure if the rejection was due to randomness or not.

What happens when we run the model and get, say, a p-value equal to 0.05 is to decide what is more likely- either our null is false and we have identified an effect or the null is true and this is one of the 5% of times where there is no effect but the model is rejecting the null anyway just by chance. Stated the opposite way, if we assume the null is true and the effect is really 0, we would get a test statistic as large as the one we observed by chance 5% of the time. We see a p-value equal to 0.05 and have to decide which answer we think is more plausible- a random rejection we know happens infrequently or an effect that’s actually there.

The columns you see for each parameter estimate (i.e., coefficient, beta hat, etc.) are all related. The coefficient gives you the model’s estimate of the effect size. It is the impact a one unit change in x has on y in a univariate model. In a multivariate model, it is the impact a one unit change in x has on y holding the other regressors constant.

The standard error is derived from the overall model’s error and the variance of x in a univariate model. In a multivariate model the standard error is derived from the overall model’s error and the variance of x and the “correlation” or association between x1 and the other independent variables in the model (e.g., x2, x3, etc.)  The higher this association, the smaller the denominator will be in the equation that calculates the standard error and the higher the standard error will be which will result in lower test statistics and higher p-values. This is the impact of multicollinearity.

The test statistic is the ratio of the coefficient estimate to the coefficient’s standard error. It has been shown that dividing the parameter estimate by the standard error results in a number that is t distributed. Since we know what distribution this number follows, we can calculate the probability of a test statistic being as large as the one we see or larger. Technically, since the t distribution is continuous we’re really looking at the probability of the test statistic just being larger, but this doesn’t change the intuition of the test.

Once we have calculated the test statistic we can compare the size of our test statistic to critical values associated with alpha levels of the test. The 0.05, or 5% level, is common. The actual test statistic we need to reject the null hypothesis of 0 at the 5% level depends upon the sample size and the degrees of freedom, but a rule of thumb for decently sized samples is that the test statistic needs to be around 2 to reject the null of 0. Actually, it’s probably closer to 1.96 or 2.07,  but these round to 2 and you can think of it like this: The model needs to estimate an effect size twice as large as the standard error of the effect size for us to reject a null hypothesis of 0 at the 5% level.

If you’re reading this and confused about one-tailed vs two-tailed tests it comes down to whether we are only looking on one side of the curve of the t distribution. If you are assuming a null of zero but willing to accept evidence of positive or negative effects the 0.05 is split between both tails so you have 0.025 in each tail. A two-tailed 0.05 test is “more conservative” since you need a larger test statistic on either side to reject. A one-tailed 0.05 test is “less conservative” if I say I only care about positive effect sizes, and therefore positive test statistics, and then that 0.05 area is no longer evenly divided between the right and left tail of the distribution.

With the test statistic we now have, we then use the theoretical t distribution to look at the area under the curve that is larger than the test statistic. This area is the p-value and can be interpreted as the percentage of times we would reject a null of 0 if the null were true and there really is no effect.

The last two columns are the upper and lower bound of the 95% confidence interval.  This is another counterintuitive part of classical hypothesis testing. That we will get to in a moment after going over the mechanics of the calculation of the upper and lower bounds. A rule of thumb is that the 95% confidence interval bounds are approximately 2 plus and minus the coefficient estimate.

…. wait….. 2? Again?

Yes. And for the exact same reason we were using 2 before. It goes back to the critical value based on sample size, number of parameter estimates, and degrees of freedom once again.

Back to the interpretation of the confidence interval. Here I’m going to quote Jeffrey Wooldridge, Introduction to Econometrics 4e, pg. 138,  since I can’t say this any better or even add much to it:

“If random samples were obtained over and over again, with the (upper boundary) and (lower boundary) computed each time, then the unknown population value of beta_j would lie in the interval (lower bound, upper bound) for 95% of samples. Unfortunately, for the single sample that we used to construct the (confidence interval) we do not know whether beta_j is actually contained in the interval.”

Again, thinking about Monte Carlos is helpful for to visualize that, too. You see the results of the Monte Carlos here. The file contains detailed comments at each step of the process. The Monte Carlos use sample sizes of 10,000 and run 1,000 regressions of randomly generated y on randomly generated x. You can directly see that test statistics larger than the 5% critical value of the t distribution are observed roughly 5% of the time.