You bring up a good point. Technically, the test results don’t prove the Artery Package makes you healthier, but they do show a correlation between using the Artery Pack and higher Iron levels. To understand better, let’s dive in a little:
Goal of the Project
In this exercise, you are imagining that you work for a startup and you are trying to find insights in the data that allow you to make marketing claims. Ideally, you will want to show that both packs have health benefits, but then up-sell the customers on the Artery Pack by showing that it has additional benefits over the Vein Pack.
Understanding the Data
In this section of the project, you weren’t able to show that the Artery Pack increased the subscribers’ lifespans. However, lifespan may not be the only way to measure a health benefit. Having low Iron levels is known as Iron-Deficiency Anemia, which can result in health problems. So, when you look at the table below, you can see that 70% of Vein Pack subscribers report having low iron levels, while only 20% of Artery Pack subscribers report the same.
||200 * 0.7
||145 * 0.2
||200 * 0.2
||145 * 0.6
||200 * 0.1
||145 * 0.2
This could mean something if it is not simply by chance. To see whether this is by chance, we run a Chi-squared test. A Chi-squared test essentially tells us whether there is a significant difference between two or more categories of data. Here, our null hypothesis is that there is no significant difference between the two sets of data (meaning any difference we think we see is by chance). We can reject the null hypothesis if the p-value returns less than 0.05.
Because our p-value for this test comes back as 2.92271335499e-19, we can be sure that the difference we observe in the datasets is statistically significant. This means that we can confidently say that there is a positive correlation between subscribing to the Artery Pack and the customers’ Iron levels.
Of course, this doesn’t really show that
"The Artery Package Is Proven To Make You Healthier!" This marketing claim is absolutely exaggerated, no question about it. However, it might be completely valid depending on the advertisement laws of your country (advertisers get a lot of leeway, particularly when it comes to health supplements).
The most important takeaway here is how to perform a Chi-squared test to check for significant differences between categories of data.
CLICK HERE to dive into the weeds about the flaws of this project and gain some food for thought
For most people, I think the above explanation should suffice, but if you want to really dissect this project, consider the following:
Iron levels — Better or worse?
We have shown that there is a significant difference between the Iron levels of Vein Pack and Artery Pack subscribers, and we’ve focused on the positive correlation to the Artery Pack. But what if the data is really showing a negative correlation to the Vein Pack? In this exercise we don’t know the customers’ Iron levels before they use the products, so our analysis shows that it is equally likely that using the Vein Pack lowered our customers’ Iron levels, rather than our initial conclusion that the Artery Pack boosted their Iron levels. Without that baseline data, our conclusion is misleading (at best), or completely wrong (at worst).
The Vein Pack increases life expectancy — or does it?
During the first part of this project, we determined that:
"The Vein Pack Is Proven To Make You Live Longer!" by using a 1-Sample T-Test. Although this is another example of marketing exaggeration, the real concerning thing is that our T-Test never actually showed that life expectancy for Vein Pack users is higher than the average life expectancy of 71 — it only showed a significant difference between Vein Pack users and non-users. This is because
scipy.stats.ttest_1samp performs a two-sided test, rather than a one-sided test (documentation here).
So how do we know the direction of the significance? Is it higher or lower (longer or shorter life expectancy)? If you’re interested in how to find this out, check out my recent answer to that exact question here: Doubt about the two-sided t-test in a project
Stay inquisitive, and happy coding!