Python scipy.stats chi2_contengency

I have a question about the interpretation of the chi2 contingency test from scipy.stats package.

In an example, cited in the familiar project form Hypothesis Testing- part of data analysis with python course in codeacademy, two different packages were compared.

vein | artery
low 200 * 0.7 |. 145 * 0.2
normal 200 * 0.2 |. 145 * 0.2
high 200 * 0.1 |. 145 * 0.6

After observing that p-value from Chi2 contingency test is less than 0.05, it was said that
‘The Artery Package Is Proven To Make You Healthier’!!!

The test only prove the dependence between the two packages, not that one is superior to another.
I would like to understand how one can conclude that which package makes one healthier.

I appreciate any help.


1 Like

I did this project a few years ago. Let me go back and look at my code and see if I can help you out here.

But first, if I recall correctly, I think there are two parts to this project–one is a two sample t-test comparing the sample means to the population mean and the second part is a chi square test, testing the iron level differences between the two and checking to see if there is a significant difference between the two packs (right?)

1 Like


You bring up a good point. Technically, the test results don’t prove the Artery Package makes you healthier, but they do show a correlation between using the Artery Pack and higher Iron levels. To understand better, let’s dive in a little:

Goal of the Project

In this exercise, you are imagining that you work for a startup and you are trying to find insights in the data that allow you to make marketing claims. Ideally, you will want to show that both packs have health benefits, but then up-sell the customers on the Artery Pack by showing that it has additional benefits over the Vein Pack.

Understanding the Data

In this section of the project, you weren’t able to show that the Artery Pack increased the subscribers’ lifespans. However, lifespan may not be the only way to measure a health benefit. Having low Iron levels is known as Iron-Deficiency Anemia, which can result in health problems. So, when you look at the table below, you can see that 70% of Vein Pack subscribers report having low iron levels, while only 20% of Artery Pack subscribers report the same.

Iron Level Vein Pack Artery Pack
Low 200 * 0.7 145 * 0.2
Normal 200 * 0.2 145 * 0.6
High 200 * 0.1 145 * 0.2

This could mean something if it is not simply by chance. To see whether this is by chance, we run a Chi-squared test. A Chi-squared test essentially tells us whether there is a significant difference between two or more categories of data. Here, our null hypothesis is that there is no significant difference between the two sets of data (meaning any difference we think we see is by chance). We can reject the null hypothesis if the p-value returns less than 0.05.


Because our p-value for this test comes back as 2.92271335499e-19, we can be sure that the difference we observe in the datasets is statistically significant. This means that we can confidently say that there is a positive correlation between subscribing to the Artery Pack and the customers’ Iron levels.

Of course, this doesn’t really show that "The Artery Package Is Proven To Make You Healthier!" This marketing claim is absolutely exaggerated, no question about it. However, it might be completely valid depending on the advertisement laws of your country (advertisers get a lot of leeway, particularly when it comes to health supplements).

The most important takeaway here is how to perform a Chi-squared test to check for significant differences between categories of data.

The Nitty-Gritty

CLICK HERE to dive into the weeds about the flaws of this project and gain some food for thought

For most people, I think the above explanation should suffice, but if you want to really dissect this project, consider the following:

Iron levels — Better or worse?

We have shown that there is a significant difference between the Iron levels of Vein Pack and Artery Pack subscribers, and we’ve focused on the positive correlation to the Artery Pack. But what if the data is really showing a negative correlation to the Vein Pack? In this exercise we don’t know the customers’ Iron levels before they use the products, so our analysis shows that it is equally likely that using the Vein Pack lowered our customers’ Iron levels, rather than our initial conclusion that the Artery Pack boosted their Iron levels. Without that baseline data, our conclusion is misleading (at best), or completely wrong (at worst).

The Vein Pack increases life expectancy — or does it?

During the first part of this project, we determined that: "The Vein Pack Is Proven To Make You Live Longer!" by using a 1-Sample T-Test. Although this is another example of marketing exaggeration, the real concerning thing is that our T-Test never actually showed that life expectancy for Vein Pack users is higher than the average life expectancy of 71 — it only showed a significant difference between Vein Pack users and non-users. This is because scipy.stats.ttest_1samp performs a two-sided test, rather than a one-sided test (documentation here).

So how do we know the direction of the significance? Is it higher or lower (longer or shorter life expectancy)? If you’re interested in how to find this out, check out my recent answer to that exact question here: Doubt about the two-sided t-test in a project

Stay inquisitive, and happy coding!