Need Feedback: Hypothesis Testing Fetchmaker Project

Hi Codecademy team,

I have just finished the Hypothesis Testing module in the Data Science path. I am feeling overwhelmed by all the information that I have to absorb and understand, before completing the entire module, as some of you know that there is a project about Fetchmaker, a start-up company.

I am genuinely interested in receiving some constructive feedback from other aspiring data scientist in Codecademy on my Python coding skill in regards to this particular project.
Below is the snapshot of my code:

import numpy as np
import fetchmaker
# Number 7
from scipy.stats import binom_test
# Number 9
from scipy.stats import f_oneway
# Number 10
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Number 13
from scipy.stats import chi2_contingency

# Number 1
fetch_maker = fetchmaker.dogs
# print(fetch_maker)

# Number 2
rottweiler_tl = fetchmaker.get_tail_length('rottweiler')
# print(rottweiler_tl)

# Number 3
rottweiler_tl_mean = np.mean(rottweiler_tl)
rottweiler_tl_std = np.std(rottweiler_tl)
print('The rottweiler avg tail length is {} \n'. format(rottweiler_tl_mean))
print('The rottweiler std dev of tail length is {} \n'. format(rottweiler_tl_std))

# Number 4
whippet_rescue = fetchmaker.get_is_rescue('whippet')
# print(whippet_rescue)

# Number 5
# To count the number of entries that are not zero (1)
num_whippet_rescues = np.count_nonzero(whippet_rescue)
print('The count of (1) entry in the whippet_rescue is {} \n'.format(num_whippet_rescues))

# Number 6
# To get the number of samples using np.size
num_whippets = np.size(whippet_rescue)
print('The number of samples in the whippet_rescue is {} \n'.format(num_whippets))

# Number 7 and 8
expected_percentage_whippets_rescue = 0.08
binom_test_whippets_rescues = binom_test(num_whippet_rescues, num_whippets,expected_percentage_whippets_rescue)
print('The P-Value of the whippet_rescue is {} \n'.format(binom_test_whippets_rescues))
print('So the P-Value from the Whippet_Rescue Binomial Test is %.3f and therefore, we accept the null hypothesis, which is that there is no difference between the observed number of whippet rescues and our expected whippet rescues percentage'%(binom_test_whippets_rescues))
print('\n')

# Number 9
# since these datasets are numerical, we will be using ANOVA test to ensure the probability of False Positive stays 0.05 
whippets_weight = fetchmaker.get_weight('whippet')
terriers_weight = fetchmaker.get_weight('terrier')
pitbulls_weight = fetchmaker.get_weight('pitbull')
ANOVA_mid_size_dogs = f_oneway(whippets_weight, terriers_weight, pitbulls_weight)
print('The P-value obtained from the ANOVA test on these three popular breeds is %.3f and therefore, we reject the null hypothesis, which is there is significant difference in the average weights of these three dogs, but we do not know which pair of datasets is significantly different.'% (ANOVA_mid_size_dogs[1]))
print('\n')

# Number 10
# To know which pair has a significant difference in their mean, we must use Tukey's Range test
data = np.concatenate([whippets_weight, terriers_weight, pitbulls_weight])
labels = ['whippet'] * len(whippets_weight) + ['terrier'] * len(terriers_weight) + ['pitbull'] * len(pitbulls_weight)

tukey_result = pairwise_tukeyhsd(data, labels, alpha = 0.05)
print("Below is the table generated from the Tukey's Range Test to find out which pair of datasets is statistically different: \n {}".format(tukey_result))
print('\n')

# Number 11
poodle_colors = fetchmaker.get_color('poodle')
shihtzu_colors = fetchmaker.get_color('shihtzu')
# print(poodle_colors)
# print(shihtzu_colors)

# Number 12
#First, obtain the color numbers for poodle breed
black_poodle = np.count_nonzero(poodle_colors == 'black')
brown_poodle = np.count_nonzero(poodle_colors == 'brown')
gold_poodle = np.count_nonzero(poodle_colors == 'gold')
grey_poodle = np.count_nonzero(poodle_colors == 'grey')
white_poodle = np.count_nonzero(poodle_colors == 'white')
#Secondly, obtain the color numbers for shihtzu breed
black_shihtzu = np.count_nonzero(shihtzu_colors == 'black')
brown_shihtzu = np.count_nonzero(shihtzu_colors == 'brown')
gold_shihtzu = np.count_nonzero(shihtzu_colors == 'gold')
grey_shihtzu = np.count_nonzero(shihtzu_colors == 'grey')
white_shihtzu = np.count_nonzero(shihtzu_colors == 'white')
#Next, create the contingency table using a list of lists
color_table = [[black_poodle, black_shihtzu],[brown_poodle, brown_shihtzu], [gold_poodle, gold_shihtzu], [grey_poodle, grey_shihtzu], [white_poodle, white_shihtzu]]

# Number 13
chi2, pval, dof, expected = chi2_contingency(color_table)
print('The statistic of the color_table dataset is %.3f \n'%(chi2))
print('The P-Value of the color_table dataset is %.3f \n'% (pval))
print('The degrees of freedom from the color_table dataset is {} \n'.format(dof))
print('The expected table is as follows: \n {}'.format(expected))
print('\n')
print('The conclusion from the Chi-Square test above is since the P-Value is %.3F, we reject the null hypothesis and stated that there is a significant difference between the datasets'% (pval))

Below are the outputs:

The rottweiler avg tail length is 4.2361 

The rottweiler std dev of tail length is 2.06475368749 

The count of (1) entry in the whippet_rescue is 6 

The number of samples in the whippet_rescue is 100 

The P-Value of the whippet_rescue is 0.581178010624 

So the P-Value from the Whippet_Rescue Binomial Test is 0.581 and therefore, we accept the null hypothesis, which is that there is no difference between the observed number of whippet rescues and our expected whippet rescues percentage


The P-value obtained from the ANOVA test on these three popular breeds is 0.000 and therefore, we reject the null hypothesis, which is there is significant difference in the average weights of these three dogs, but we do not know which pair of datasets is significantly different.


Below is the table generated from the Tukey's Range Test to find out which pair of datasets is statistically different: 
 Multiple Comparison of Means - Tukey HSD,FWER=0.05
==============================================
 group1  group2 meandiff  lower  upper  reject
----------------------------------------------
pitbull terrier  -13.24  -16.728 -9.752  True 
pitbull whippet  -3.34    -6.828 0.148  False 
terrier whippet   9.9     6.412  13.388  True 
----------------------------------------------


The statistic of the color_table dataset is 14.727 

The P-Value of the color_table dataset is 0.005 

The degrees of freedom from the color_table dataset is 4 

The expected table is as follows: 
 [[ 13.5  13.5]
 [ 24.5  24.5]
 [  7.    7. ]
 [ 46.5  46.5]
 [  8.5   8.5]]


The conclusion from the Chi-Square test above is since the P-Value is 0.005, we reject the null hypothesis and stated that there is a significant difference between the datasets

I know this is a lot to ask, but this particular project consumed a lot of my time due to the fact that I am particularly new to statistic’s field and Python. Therefore, some feedback would be very appreciated so I can improve on my statistic skill and Python coding skill.

Oh I forgot to mention, in the Python Code section, there are comments such as (#number1, #number2 etc etc). These comments were used as a way for me to keep track on which code belongs to which task in the module.

Thanks heaps in advance,

Jimmy

Hi Jimmy,
It’s a very detailed project and if this is your first time doing so, I think you did a really good job. :slight_smile: You seem to have a good understanding of the project & what’s being asked.
You also used comments (I use those too to organize data & my thought processes for myself) which is not only beneficial to you, but also to anyone else who is reading your code.
I did this project a couple years ago for a data analysis intensive (it’s now part of the DS path) and I was just going through my code while going over yours.
Anyway, I was going to ask about this:

Is that referring to this value: 3.27641558827e-17 ?
You are correct. That is a scientific number, there are 17 zeroes before the 3. (If you ever use Excel that’s how they also note large decimals.)
The wording is a bit confusing though with the NULL & alternative hypotheses. When you reject the NULL, you reject that there is no statistically significant difference between the variables.
Remember the NULL is the status quo, if you will, or, that there is no statistical significant difference between the variables. The alternative hypothesis is just that, the alternative, that there is a statistically significant difference (ie, the p-value is <0.05).

1 Like

Hey Lisa,

I really really really appreciate the time you took to read through my code. So happy to hear your feedback. Now, to answer your questions:

  1. Is P-Value (0.000) from ANOVA referring to this value: 3.27641558827e-17 ?
    Yes, it is referring to 3.27641558827xxxxxx.
    I am glad you brought this up, because I am confused with this number. At first, I thought this number will be rounded to 3.27. But when I tried to round it using the round() function, it results in 0.0
    So I am quite confused about how this kind of numbering works in Python? Why was it rounded to 0.000 ? and not 3.27641? This simple misunderstanding can lead to a disaster in real-life, by informing key stakeholders the wrong info.

  2. The null hypothesis for the ANOVA test was ‘There is NO significant difference in the average weights of the three popular dogs’.
    Thank you for pointing this out. I also realised that the sentence can be misleading too. Silly me!

Lisa, can you please tell me, in the Data Scientist workflow, when do we start Hypothesis testing?
My understanding of Data Scientist/Data Analyst Workflow is as follow (please feel free to correct me if I got them wrong):

  1. Understanding the business objective (the questions we are trying to answer)
  2. Gather the relevant data
  3. Prepare and clean the relevant data
  4. Analyse the relevant data
  5. Organise and present our findings
  6. Make recommendations based on our findings

From this workflow, where do hypothesis tests take place? and why?

I hope you can help me to satisfy my curiosity.
Thanks heaps in advance,

Jimmy

You’re welcome!

Okay, so, this is scientific notation. Python (Pandas) uses this for very small or very large numbers so it doesn’t have to use a logarithm.
It means 3.27 times 10 to the minus 17 power. Or, 3.27 * 10 -^17 (you move the decimal 17 places to the left. (“e-17” part).

It never occurred to me, but this is something that you can suppress if you choose to. Here’s some info on that:
https://re-thought.com/how-to-suppress-scientific-notation-in-pandas/

Translated? It means that the p-value is <0.05 and that you can reject the null. But, I think it’s also a good idea to continue to read up on statistical significance testing and pvalues, hypotheses, etc.

The second part of your query–yea, that’s a general idea of what DSs do. I am not one (I’m a data analyst with a background in Sociology, thus the interest in hypothesis testing) but aspire to be one! :slight_smile: Each number in your list could have sub numbers below it with more details I think it also matters what industry you’re in too. “Gathering data” could also include accessing an API, building a web scraper, etc. I can say that DAs and DSs do a lot of data cleaning. Hypothesis testing would be under number 4 on your list after EDA (exploratory data analysis)–descriptive statistics and then inferential statistics (using your data samples to make generalizations about a population).

Besides critical thinking, I think another key skill to have is the ability to explain high level technical concepts to nontechnical audiences (stakeholders). So, know your audience when presenting. For instance–in presentations don’t go into too much technical detail, but, have it at the ready and be able to explain your analysis and findings more technically if asked (obv).

I highly encourage you (and others) to read up on what exactly DS (or data analysts) actually do–day to day, what types of tools they use and technologies they should be familiar with.
ex:
https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists

Also, here’s Catherine Zhou, a DS at CC talking about what she does:
https://news.codecademy.com/what-does-a-data-scientist-do/

There’s also a really great blog on Medium called “Towards Data Science”.
https://towardsdatascience.com

I hope this helps! :slight_smile:

1 Like

Hey Lisa,

Thank you so much for your continuous support on my questions. They really help me to fill some of the knowledge gap I have.

Kind Regards,
Jimmy

1 Like

Have a look of my solution

https://gist.github.com/ca343772cd36ba2ec2a9ca59a14931b1https://gist.github.com/ca343772cd36ba2ec2a9ca59a14931b1