Hello,
I am facing confusion as to why we have to assign two variables when calculating the Pearson correlation. For example:
from scipy.stats import pearsonr
corr_price_sqfeet, p = pearsonr(housing.price, housing.sqfeet)
print(corr_price_sqfeet)
Here I am wondering why we assign both corr_price_sqfeet
and p
to pearsonr(housing.price, housing.sqfeet)
. What does p
represent?
Thanks in advance!
Itâs because youâre measuring the linear dependence between two variables. In this case, housing price & sq. feet. The range is -1 and 1 (a negative relationship or a positive one, or, 0 meaning no relationship).
More:
https://docs.scipy.org/doc//scipy-1.2.3/reference/generated/scipy.stats.pearsonr.html
Is this an instance of unpacking? That is, would corr_price_sqfeet
match to housing.price
and p
to housing.sqfeet
?
If so, I am still not appreciating the functionality of assigning the variables to the arguments in this way.
HmâŚIâm not sure Iâm following what âunpackingâ is.
Youâve imported the pearson module from scipy stats and it requires two variablesâso, youâre passing through the two quantitative variables that you think might have a relationshipâhousing price and sq. feet.
Youâre calculating the correlation (corr_price_sqfeet
) but also the p value (but that value only relates to the sample data). Though a Pearson correlation doesnât show us any significance; just that the two quantitative variables in the sample data are either positively, negatively or not at all correlated. Youâre seeing the strength of the relationship (range -1 and 1).
Usually after a Pearson test one would do a hypothesis test (t-test) to see if thereâs any significance. Your null hypothesis (Ho) would be something like, square footage has no effect on sale price and, your alternative (Ha)âsquare footage does affect sale price.
By â[unpacking](https://www.geeksforgeeks.org/unpacking-a-tuple-in-python/)
â I mean assigning variables to arguments. Although, I no longer think this is relevant here.
So in the about example, it seems that corr_price_sqfeet
is assigned to whatever the value that value is that the Pearsonr()
provides. But what is the p value? When I print this out, it does not make much sense in relation to my dataâŚ
Also, thank you for the general stats insight!
yep, the first thing is the Pearson Correlation coefficient (often represented by r in formulas where it means sample correlation coefficient).
As for the second thing âŚ
The p-value is used for stuff like hypothesis testing âŚ
meaning checking âAre they really correlated? or is is something that just happened by chance?â
Or stated in more detail: âAre these variables really correlated in the population? or did this correlation just happen by chance for this particular sample, and the variables are not actually correlated in the population?â
(Thatâs roughly the null hypothesis vs. the alternative hypotheses here, if youâve heard of those.)
You get probabilities of the second one as a p-value (which is calculated based on stuff including the correlation coefficient and how many things there are in the sample)
I know this isnât a precise explanation (and parts of this may be inaccurate).
Correct me if Iâm wrong please.
1 Like
I vaguely recall doing this lesson but forgot where it is on the DS path. Do you have a link?
Yes, correct.
The p value determines if the correlation is significant. (< 0.05).
But, like I said, this is just a first step in hypothesis testing.
Very cool. Thank you.
Here is the link, if you are still curious 