Calculating Pearson Correlation -- two variables?


I am facing confusion as to why we have to assign two variables when calculating the Pearson correlation. For example:

from scipy.stats import pearsonr

corr_price_sqfeet, p = pearsonr(housing.price, housing.sqfeet)


Here I am wondering why we assign both corr_price_sqfeet and p to pearsonr(housing.price, housing.sqfeet) . What does p represent?

Thanks in advance!

It’s because you’re measuring the linear dependence between two variables. In this case, housing price & sq. feet. The range is -1 and 1 (a negative relationship or a positive one, or, 0 meaning no relationship).


Is this an instance of unpacking? That is, would corr_price_sqfeet match to housing.price and p to housing.sqfeet?
If so, I am still not appreciating the functionality of assigning the variables to the arguments in this way.

Hm…I’m not sure I’m following what “unpacking” is.

You’ve imported the pearson module from scipy stats and it requires two variables—so, you’re passing through the two quantitative variables that you think might have a relationship–housing price and sq. feet.
You’re calculating the correlation (corr_price_sqfeet) but also the p value (but that value only relates to the sample data). Though a Pearson correlation doesn’t show us any significance; just that the two quantitative variables in the sample data are either positively, negatively or not at all correlated. You’re seeing the strength of the relationship (range -1 and 1).

Usually after a Pearson test one would do a hypothesis test (t-test) to see if there’s any significance. Your null hypothesis (Ho) would be something like, square footage has no effect on sale price and, your alternative (Ha)–square footage does affect sale price.

By ‘[unpacking](’ I mean assigning variables to arguments. Although, I no longer think this is relevant here.

So in the about example, it seems that corr_price_sqfeet is assigned to whatever the value that value is that the Pearsonr() provides. But what is the p value? When I print this out, it does not make much sense in relation to my data…

Also, thank you for the general stats insight!

yep, the first thing is the Pearson Correlation coefficient (often represented by r in formulas where it means sample correlation coefficient).

As for the second thing …
The p-value is used for stuff like hypothesis testing …
meaning checking “Are they really correlated? or is is something that just happened by chance?”

Or stated in more detail: “Are these variables really correlated in the population? or did this correlation just happen by chance for this particular sample, and the variables are not actually correlated in the population?”
(That’s roughly the null hypothesis vs. the alternative hypotheses here, if you’ve heard of those.)

You get probabilities of the second one as a p-value (which is calculated based on stuff including the correlation coefficient and how many things there are in the sample)

I know this isn’t a precise explanation (and parts of this may be inaccurate).
Correct me if I’m wrong please.

1 Like

I vaguely recall doing this lesson but forgot where it is on the DS path. Do you have a link?

Yes, correct.

The p value determines if the correlation is significant. (< 0.05).

But, like I said, this is just a first step in hypothesis testing.

Very cool. Thank you.
Here is the link, if you are still curious :slight_smile: