FAQ: Associations: Quantitative and Categorical Variables - Inspecting Overlapping Histograms

This community-built FAQ covers the “Inspecting Overlapping Histograms” exercise from the lesson “Associations: Quantitative and Categorical Variables”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Master Statistics with Python

FAQs on the exercise Inspecting Overlapping Histograms

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in Language Help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to Language Help and Tips and Resources. If you are wanting feedback or inspiration for a project, check out Projects.

Looking for motivation to keep learning? Join our wider discussions in Community

Learn more about how to use this guide.

Found a bug? Report it online, or post in Bug Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

How come we don’t use seaborn (sns.histplot()) in this lesson to draw the histogram as we did in the previous lessons?

Furthermore: Are there different use cases to the two methods? Is it possible to draw overlapping histograms with seaborn as well?

3 Likes

For Seaborn-- there are different parameters you can add to the plot. (I think multiple = "dodge" works for overlapping data(?)

Check the docs:
https://seaborn.pydata.org/generated/seaborn.histplot.html

Perhaps even something like this could work if we have binary variables?

sns.histplot(data=students,x='G3',alpha=0.5,stat='density',,common_norm=False, hue='address')

We could use it for more variables, but it’d become quickly unreadable.

image

In lesson 4 of 6 in the Inspecting Overlapping Histograms

How can you tell if the association is strong or weak? What does this ultimately mean if it is weak or strong?

This is what it says in the lesson:
“By inspecting this histogram, we can clearly see that the entire distribution of scores at GP (not just the mean or median) appears slightly shifted to the right (higher) compared to the scores at MS. However, there is also still a lot of overlap between the scores, suggesting that the association is relatively weak.”

So because there is a lot of overlap in the two histograms, what is making this weak?

1 Like

Did you look at the Cheatsheet for the lesson? (It’s linked within it and explains in detail).

For this, you’re comparing a binary categorical variable (2 schools) and a quantitative variable (scores). You’re trying to see if there’s a significant difference between the means. If the histograms overlap, that doesn’t show a difference between the means. You could also run a 2 sample t-test to check if the null is true or not (null hypothesis-there is no diff between the means, alt hypothesis: there is a difference between the means). But for that you have to make sure that the sample size is large and normally distributed and that the standard deviations for the two samples are equal.

Thanks, I saw this from the cheat sheet. But can you explain in another way why overlapping means less association. I’m thinking if the histograms match/overlap, there’d be an association, but in the case of overlap, this means the opposite.

Overlaid Histograms

Overlaid histograms can be used along with mean and median differences to assess an association between a quantitative variable and a categorical variable. After normalizing the histograms, more overlap indicates less association and less overlap indicates a stronger association. The example image shows math scores for students at two different schools. We see that scores tend to be higher for students at the GP school, but there is a lot of overlap in these distributions — suggesting that the association is not very strong.

You’re trying to determine if there is a correlation between attendance at a school and the test scores. So, your research question is something like, ‘Does attendance at a certain school affect one’s test scores?’ (Or, maybe it’s due to something else entirely, which you can further test for later).
From the cheat sheet:
“When variables are associated, information about the value of one variable provides information about the value of the other variable.”

That’s the association you’re testing for here–your binary categorical variable, School, and your quantitative variable, Test Scores. If there is no association (or correlation) it could be due to another variable entirely.

You’re trying to find a significant difference between the two (means), so if the histograms overlap in the visualization, there’s not much difference between the means of the two scores from the two different schools. In the previous lesson, about mean and median differences it states: “Highly associated variables tend to have a large mean or median difference.” Take a look at the page on box plots again and see the examples of how the data is distributed.

There is a YT channel that I’ve found to be really helpful in explaining the logic of hypothesis testing from this guy called Mr. Nystrom or, AP Stats Guy. He’s a high school stats teacher that explains stats in an approachable way. He has a video, “Basic Logic of hypothesis testing” which is really helpful. Another useful video is about the Null and Alternative hypothesis.
In this case, the Null is that school attendance doesn’t affect test scores, the Alt. hypothesis is that school attendance does affect test scores.

1 Like

@lisalisaj
Hi, what do you mean by a cheatsheet for the lesson?
Does such a thing still exist?
Or it existed in the past?

Looking at it now, 8 months later… I see that that there’s no link to concept cheat sheets in this particular lesson. Perhaps they’re now gone(?) Each section and subsection used to have a link at the bottom of the left pane for a “Cheat Sheet” that would take you to another page that had definitions & examples of each concept learned. (You could also download the sheet as a doc.)

If you look under Get Unstuck there’s a link to the CC docs. However, there’s nothing there for testing associations or Seaborn. Better off looking at the Seaborn docs or the links I posted back in Dec. Sadly now, I guess they want people to rely on generative AI for explanations. I wouldn’t bother with that. It’s better to just read documentation.

2 Likes

The excerpts (from the cheatsheet as mentioned by blog1753350213 ) in this thread match this cheatsheet: https://www.codecademy.com/learn/stats-associations-between-variables/modules/stats-associations-between-variables/cheatsheet

2 Likes

@mtrtmk Nice find. How did you find this cheatsheet, if I can ask? :slightly_smiling_face:
I see that there is a cheatsheets catalogue
https://www.codecademy.com/resources/cheatsheets/all
Did you search for this particular cheatsheet in the catalogue?

Earlier posts in the thread suggested that a cheatsheet did indeed exist previously.

I tried using the search from the main Codecademy page to search for keywords such as seaborn, histogram, etc. That showed some articles and docs on the Codecademy website but not the cheatsheet. I don’t think the cheatsheets show up in the search. However, if you know the name of your course/module, you could navigate to the cheatsheets page and use the browser’s find feature (usually CTRL+F) to search for some keyword.

Since there were some excerpts and quotes from the cheatsheet in the thread, so I did a google search for an exact phrase and limited the results to a specific site:

“information about the value of one variable provides information about the value of the other variable.” site:codecademy.com

The first result was a link to the cheatsheet.

2 Likes

Very nice. I didn’t remember that google search has a “site” parameter.

1 Like

Using Seaborn to try and recreate the overlapping histogram in the exercise, but can’t figure out why the histogram from the exercise is slightly shifted using the plt from matplotlib.pyplot. Doesn’t seem to make much sense but welcome any thoughts on this.


The codes used are:

plt.hist(scores_urban, color = "blue", label = "Urban", normed = True, alpha = 0.5)
plt.hist(scores_rural, color = "red", label = "Rural", normed = True, alpha = 0.5)

versus using the Seaborn histplot:

sns.histplot(data = students, x = "G3", hue = "address", stat = "density", element="bars", bins = 10, common_norm = False)
1 Like

Hi. One thing that I notice is that in the case of seaborn, you pass a bins parameter to create ten bins. Whereas you don’t pass a similar argument to pyplot. Perhaps this is causing the difference.

Another thought, if you have both histogram calls in your Code Editor, you can ask the AI Assistant to explain the difference in the histograms…

The bin parameter specifies how many histogram columns you are looking to break the observations down into in equal intervals. It’s the only way to make the histograms similar because in the matplot, the default plots creates the 10 bins (you can count them by number of bars) and in seaborn, it’s actually smaller unless you specify and it would look drastically different if you don’t have them the same.

AI assistant unfortunately is not helpful in this case. Talks about the two methods but there’s no prompt to explain the graphical difference between the outputs.