Is someone able to elaborate on the differences between these two scenarios? Also can someone differentiate when to you are suppose to emplement these two different codes and why?
np.random.normal(loc, scale, size) vs. np.random.binomial

Maybe I’m misunderstanding something but in the ‘Election Results’ project we are asked to generate a set of 10,000 binomially distributed values to simulate election surveys. This is fine but it states that we use the number 10,000 because that’s the number of voters, ie generating a 70 person poll for each single voter, which doesn’t really make sense.

My other small nitpick is that we are encouraged to use binomially distributed sets to calculate the probability of a certain number of successes. This is also mostly fine, but is somewhat inaccurate and has a chance (albeit a smaller chance the larger your set) of giving you a wrong answer, when there are ways of immediately calculating the correct probability.

Overall the information has been generally correct, just someone who knows statistics here being nitpicky.

The first (binomial distribution) will give you a number of values equal to ‘size’ that will between 0 and N. The mean will be around NP. The values represent individual trials where something of probability P is attempted N times and the successes counted.

The second case will give you a number of values equal to ‘size’ distributed around the mean (‘loc’) but with variance (‘scale’) defined instead of min/max value. This standard pattern of distribution occurs in many real world datasets.

I don’t think this course goes into it in any detail on their relationship, but you can think of the first case as a specific scenario that also satisfies the second case.

Deprecated; use the density keyword argument instead.

density : bool, optional

If True , the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations. If stacked is also True , the sum of the histograms is normalized to 1.

Default is None for both normed and density . If either is set, then that value will be used. If neither are set, then the args will be treated as False .

If both density and normed are set an error is raised.

I don’t think I understand the explanation on the manual…

In short, don’t use normed at all unless you’re using older versions of matplotlib that don’t support the better alternatives.

Say you had 100 total observations, your first bin (x-axis) was between 0 and 4 and the number of observations that fall within this bin interval was 20. The new height (frequency density) of this bin would then be 20 / (100 * 4) or 0.05 when using density=True where the calculation is frequency_density(interval) = observations / (total_observations * interval_width).

This is such that if each bin had height * width (area) the sum of all the combined bins would be 1. If you just plotted the frequency (even a normalised frequency) rather than frequency density then certain bins might appear to contain far more observations than they actually do (unequal bin sizes makes interpretation difficult otherwise).

Wikipedia goes into what makes a histogram and why normalising it is often useful-