FAQ: Hypothesis Testing - Dangers of Multiple T-Tests

This community-built FAQ covers the “Dangers of Multiple T-Tests” exercise from the lesson “Hypothesis Testing”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

FAQs on the exercise Dangers of Multiple T-Tests

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hi there, when I input tts, a_b_pval = ttest_ind(a, b), the result of the a_b_pval is 2.76676293987e-05, which clearly is not a p-value. What gone wrong?

2.76676293987e-05 is scientific notation for 0.0000276676293987, which is a p-value showing that the result of that particular t-test has an exceptionally good chance at being significant.

2 Likes

What error are we calculating using the error probability function provided? As I understand it, the p-value is essentially the probability of a type 1 error. So, over multiple t-tests:

p(type 1 error) = p_value_1 * p_value 2 * … * p_value_n

thus:

p(not(type 1 error)) = 1 - (p_value_1 * p_value 2 * … * p_value_n).

In this way, multiple t-tests would actually decrease your chance of a type 1 error.

I think my issue here actually boils down to two questions. First, what source of error does the provided error probability function calculate? Second, can p(statistical significance) decrease while p(type 1 error) also decreases?

7 Likes

Why is the solution:
error_prob = (1-(0.95**3))

I thought it would be:
error_prob = (1-(a_b_pval * a_c_pval * b_c_pval)).

I.e. the total error is a function of the individual errors - not the threshold for for acceptable error?

9 Likes

In agreement with the above, with one small correction:

error_prob = 1 - (1 - a_b_pval) * (1 - a_c_pval) * (1 - b_c_pval)

This does cause a higher error, of 0.07 roughly. But that is not the solution Codecademy went with.

16 Likes

This is what I thought too. I found this entire exercise completely confusing.

2 Likes

I think the main confusion here is due to these lessons using the term “p-value” interchangeably for both the significance value (the threshold at which we will determine the results are significant) and the actual p-value that is returned by running a T-test.

Here are the concepts to remember with T-tests:

  • We are comparing samples of different populations to see if the populations are significantly different

  • We determine a significance value (or p-value threshold) prior to conducting the T-tests that will act as a cut-off point for whether we will find significance

  • A T-test returns two values: a test statistic (tstat) and a p-value. The test statistic is basically a number that represents the difference between population means based on the variations in your sample. The larger it is, the less likely a null hypothesis is true. If it is closer to 0, it is more likely there isn’t a significant difference. The p-value is the likelihood of getting a test statistic of equal or higher value to the one returned, if the null-hypothesis is true.

The p-value itself is not the probability of a Type I error, but rather the probability of getting a test statistic (tstat) of equal or higher value if the null hypothesis is true (i.e., if the populations have the same mean and the observed differences were merely by chance). The smaller the p-value, the more likely there is significance.

Prior to running the T-tests, however, we decide that a p-value at .05 or less will indicate significance – thus we are accepting a risk of being wrong 5% of the time when we reject the null hypothesis. We would reject a null hypothesis equally if the p-value was .04 or .00004. Thus, we have a fixed risk of Type I error per T-test that is determined prior to running the experiment. This is the error the lesson is referring to.

This 5% accepted risk is compounded for each T-test we need to run during the experiment to compare each sample with each other sample, and that is why running multiple T-tests can be problematic.

Hope this helps!

20 Likes

This is also what I thought. Why don’t they change the answer?

Hi everybody

Regarding calculating the error in 2 sample t-tests; what is the threshold probability for error, in order to make us accept the null hypothesis?

Kind regards
Ben

We agree to accept the null hypothesis (that both samples effectively represent the same population) if p > 0.05.

Here’s my take on this. (I’m writing this down to clear my head after a couple of hours in therapy with Dr. Google trying to get it sorted.)

Given these two conditions:

  1. The null hypothesis is true for every test we run.
  2. We agree to reject the null hypothesis if the t-test returns a p < 0.05

The question is:

If I do n comparisons via a t-test, what is the probability that I will see at least one result out of that series that tells me that the null hypothesis is false (i.e., at least one false positive)?

  • We can agree that if n = 1, the probability is 5%, or 0.05. The probability that I will be correct, i.e., will not make the mistake of publishing a false positive, is 0.95

  • We’re rolling a 20-sided die here. One side says “reject”, the other 19 say “accept”.

  • If I roll twice, what is the probability of at least one “accept”?

  • The best way to look at this is to realize that the probability of getting at least one hit in n tries is the inverse (1 - p) of the probability of never getting a hit in n tries.

  • So, what is the probability of never getting an “accept” in 2 tries: it is P(reject) * P(reject) = 0.95**2, and the inverse of that is (1 - 0.95**2), or 0.0975, or nearly 10% chance of one “reject” in two rolls.

  • Likewise, for any number, n, of rolls, the probability of never getting “accept”, i.e., getting “reject” on every roll, is P(reject) **n, which in our specific case is 0.95**n.

  • As above, if the probability of never getting any “accept” is 0.95**n, the probability getting at least one “accept” is (1 - 0.95**n), or 0.142625 for n = 3.

  • So, what the problem is saying is, "If the three samples are in fact not closely correlated (i.e., the null hypothesis is true), then there is a 14.26% probability that you will see at least one “reject the null hypothesis” condition if you run the t-test three times."

6 Likes

Thank you so much. It became much clear.

It seems the 0.05 pval that the lesson is asking you to assume is an arbitrary number when the actual pvals are calculated in the lesson. Why wouldn’t we use the actual pval of 2.76676293987e-05 +
0.0210120516986 + 0.0598856352397 ?

same question here.

If we have already calculated the specific P value, why still need to use 0.05 ?

Thankyou @el_cocodrilo @patrickd314 for your clear explanation. That’s Gold for me.

Yep me too. Can anyone please explain why we’re using what seems to be an arbitrary value of 0.05 and not the p-values we found in the lesson?

I think @el_cocodrilo’s post answers your question.

1 Like

Awesome explanation, thanks!
I have another question tho, if the error probability gets bigger with the more t-tests we do, is it possible to select a higher p-value threshold (significance) so the compounded error probability is still acceptable?

To put it in the context of this exercise:

1-0.95**3 = 0.142625

1-0.97**3 = 0.087327 (which is higher than 0.05, but still kind of acceptable)

1-0.99**3 = 0.029701 (which is lower than 0.05)

My first guess is that this would be a problem or at least an inconvenience of some sort, but I’m still trying to wrap my head around this :sweat_smile:

1 Like

The idea is exactly what is used in some standard ways to avoid the dangers described in this exercise. A simple one is Bonferroni correction, which keeps the overall significance level low by lowering the significance level of each test (by dividing by the number of hypotheses). You’ll learn another way called Tukey’s range test in a later exercise. This is a more complicated approach, but it is based on a similar idea.

1 Like