Hypothesis Testing - Checking for Assumptions

Hi Codecademy team,

I just finished going through the narrative on Hypothesis Test Page 7. I have some doubts that I need data science experts to help me clear them out, please :slight_smile:

In the exercise, we are trying to figure out which distributions are not normal, and which ones are not suitable for ANOVA test. Lastly, we also want to check if these 2 dataset distributions (distribution 2 and distribution 3) can be used to perform the numerical hypothesis test.

Below are the histograms plot for distribution 1, 2, 3 and 4:


By looking at these histograms, I interpreted that:

  1. Distribution 1, 3 and 4 are not normal. Therefore, only distribution 2 has a normal distribution. Is this true?

  2. The question also asked " Which of these distributions would probably not be a good choice to use in an ANOVA comparison? Create a variable called not_normal and set it equal to the distribution number that would be least suited to be used in an ANOVA test."

My answer to this question was distribution 1, 3 and 4 are not suited for ANOVA test due to the fact that they don’t have normal distributions.
But, the narrator’s answer is distribution 4 only. Why is that?

  1. The last question is " Calculate the ratio of standard deviations between dist_2 and dist_3 and store it in a variable called ratio. Print it to the console. Is this “close enough” to perform a numerical hypothesis test between the two datasets?"

Below is the code to calculate dist_2 and dist_3 std dev and their ratio:

dist_2_std = np.std(dist_2)
dist_3_std = np.std(dist_3)
ratio = dist_2_std / dist_3_std
print(dist_2_std)
print(dist_3_std)
print(ratio)

The result of the code is

2.93237588202
5.0434543879
0.58142210804

As you can see, the ratio is 0.58, which in my opinion is not close enough to 1 and therefore, these two datasets will not be suitable to perform numerical hypothesis tests. Is this conclusion correct or wrong ? if wrong, why is it wrong?

Please help me demystify this matter. Also, please don’t post any unnecessary comments to make my life and other learners’ life easier to re-visit this topic if needed.

Thank you very much,

Jimmy

4 is not a normal distribution. The curve is bimodal.
(3 just seems to have an outlier)

In the lesson it says, ‘If your dataset is definitively not normal, the numerical hypothesis tests won’t work as intended.’
(Variance is the spread of your data and how far away points are from the mean… It might be helpful to go back over variance and standard deviation.)
variance & standard deviation:
https://stackabuse.com/calculating-variance-and-standard-deviation-in-python/

In order to use an ANOVA (analysis of variance) test (where you’re comparing the means between different datasets), you have to have data that is normally distributed (think, Bell curve where the standard deviations from the mean are similar) and the standard deviations should be equal. The std dev on the 4th histogram are not equal.

if the variances are not different, then the ratio will be close to 1. 0.58 is not close to 1, so in this case there is a difference between the two sets of data.

ANOVA will tell you that there is a difference between datasets but it won’t tell you which ones have a sig difference. That’s when you run a Tukey’s range test to see exactly which datasets are different.

f-test:
https://www.medcalc.org/manual/comparison_of_standard_deviations_f-test.php

1 Like

Hey Lisa,

Thanks heaps for your reply… I think I got the idea behind the actual answers that the narrator provided, although I had to read your answers like 3x to try link which sentences belong to my 3 questions hahaha.

Nevertheless, thanks again!

Jimmy

Sure, np.

Hypothesis testing and stats are a LOT to take in. Sometimes it hurts the brain. (or, maybe that’s just me. ha). I still get tripped up sometimes.
However, if you ever get stuck, there’s this great stats teacher on YT, Mr. Nystrom, or, AP Stats Guy. He breaks down difficult concepts --hypothesis testing, null & alternative hypotheses, p-value, confidence intervals, standard deviation, etc.
https://youtu.be/S4wmS0a0Ams

Sometimes you can read and re-read stats books or info on sites and it doesn’t make sense. The way he explains everything makes sense (IMO).

Happy coding!

1 Like