FAQ: Statistical Distributions with NumPy - Review

This community-built FAQ covers the “Review” exercise from the lesson “Statistical Distributions with NumPy”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science

Introduction to Statistics with NumPy

FAQs on the exercise Review

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Is someone able to elaborate on the differences between these two scenarios? Also can someone differentiate when to you are suppose to emplement these two different codes and why?
np.random.normal(loc, scale, size) vs. np.random.binomial


Maybe I’m misunderstanding something but in the ‘Election Results’ project we are asked to generate a set of 10,000 binomially distributed values to simulate election surveys. This is fine but it states that we use the number 10,000 because that’s the number of voters, ie generating a 70 person poll for each single voter, which doesn’t really make sense.

My other small nitpick is that we are encouraged to use binomially distributed sets to calculate the probability of a certain number of successes. This is also mostly fine, but is somewhat inaccurate and has a chance (albeit a smaller chance the larger your set) of giving you a wrong answer, when there are ways of immediately calculating the correct probability.

Overall the information has been generally correct, just someone who knows statistics here being nitpicky.

1 Like

The first (binomial distribution) will give you a number of values equal to ‘size’ that will between 0 and N. The mean will be around NP. The values represent individual trials where something of probability P is attempted N times and the successes counted.

The second case will give you a number of values equal to ‘size’ distributed around the mean (‘loc’) but with variance (‘scale’) defined instead of min/max value. This standard pattern of distribution occurs in many real world datasets.

I don’t think this course goes into it in any detail on their relationship, but you can think of the first case as a specific scenario that also satisfies the second case.

Can someone please help me understand what the parameter normed=True within plt.hist() does?

Here on the manual it says:

normed : bool, optional

Deprecated; use the density keyword argument instead.

density : bool, optional

If True , the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations. If stacked is also True , the sum of the histograms is normalized to 1.

Default is None for both normed and density . If either is set, then that value will be used. If neither are set, then the args will be treated as False .

If both density and normed are set an error is raised.

I don’t think I understand the explanation on the manual…

Deprecated would mean that’s an argument you should now avoid using as it will probably be removed at some point, the current docs don’t actually mention normed, https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

In short, don’t use normed at all unless you’re using older versions of matplotlib that don’t support the better alternatives.

Say you had 100 total observations, your first bin (x-axis) was between 0 and 4 and the number of observations that fall within this bin interval was 20. The new height (frequency density) of this bin would then be 20 / (100 * 4) or 0.05 when using density=True where the calculation is frequency_density(interval) = observations / (total_observations * interval_width).

This is such that if each bin had height * width (area) the sum of all the combined bins would be 1. If you just plotted the frequency (even a normalised frequency) rather than frequency density then certain bins might appear to contain far more observations than they actually do (unequal bin sizes makes interpretation difficult otherwise).

Wikipedia goes into what makes a histogram and why normalising it is often useful-

1 Like

Thank you @tgrtim for the detailed explanation!

1 Like