There is a word that sneaks in there that is just a wee bit deceiving (not purposely, of course): 'Normal'
. Why is it called ‘normal distribution’?
In planar geometry where we have two axes, x
and y
, there are four quadrants divided by the intersecting axes such that both x and y are zero at the point of intersection, the Origin.
Now, given that we can only assume that the distribution of Real numbers above and below the x-axis
and left and right of the y-axis
is exactly even, the space in the graph is exactly cut into four quadrants.
We learn that values along the x-axis
are independent, and given the name, domain. This is the expanse of our independently arising data, or x-value
.
Conversely, the y-value
is a function of x and within the bounds of the function defined as a given range. So we have ‘domain’ and ‘range’ as descriptive terms for our graph.
That is graphing basics, but we’re not done. We wouldn’t have a graph if we didn’t have a function. The graph describes the curve of the function, or in some/many cases, the curve of the relation or simply put, the locus of the curve. Forget all this stuff for now.
A line is the most elementary curve there is. It has zero curvature, so why do we still call it a curve? Well, it is a locus. It describes a path. A line is a path from point A to point B. That makes it a curve.
Back to this even distribution of the Cartesian Plane (the 2-D thing we opened with) it follows that the y-axis
cuts the x-axis
exactly in half since we said everything is distributed evenly about the Origin. Given this exemplary status, the y-axes
has earned the special title, Normal
.
Thinking about all the above, we may conclude that the Normal is always intersecting the x-axis
. Bring statistics into the frame and now we have a specialized use for this 2-D framework.
First, the Normal is always the statistical mean (aka average) of a sample or population data set. Population means everything and sample means segment. We may not be able to get everyone’s opinion on something but we can get all the medical data on everyone on the planet. That’s the basic difference if you see these terms in your reading.
If you have 3000 students in your school and you polled all of them, giving you 3000 data points, that would also qualify as a population. From that you could derive segments that would be random samples from the data set. Selective samples have questionable statistical value given the bias, which will be reflected in a Standard Normal Curve of that data. It will be skewed one way or the other, and not render with any balance or symmetry.
The cat is out of the bag, nowthat I used that term, ‘Standard Normal Curve’ and can only hope that the reader is making a connection to the above described graph in the planar coordinate system. Statistics is really just a system of math with built in constraints, one of which is the Normal. It depends on having a midpoint which is conveniently provided by the Origin so long as there is some way to translate the data to reflect about that point.
To the rescue comes the z-score which is calculated about the mean (mu). The z-score of mu is zero. Voila, distribution about the Normal. Takes a minute to think about this but just remember that under typical circumstances, all Real numbers are evenly distributed about both axes, and more particularly for our purposes, about the y-axis
, the Normal.
The funny thing about z-score is that it can never quite reach negative 4 on the left of the Normal, nor can it reach plus 4 on the right of the Normal. The function that derives this value is computed from both mu (µ) and sigma (σ) (Standard Deviation) that is reliably (intrinsically) bound by those lower and upper limits.
We’re sneaking into the Calculus but only slightly. An elementary concept of calculus is, limits, or what we commonly refer to as, ‘limiting value of f(x)’. In other words the function will never exceed the range of the limit on either the right side (approaching from the right) or the left side (approaching from the left), or both.
If f(x)
is the function that defines the z-score, then,
lim f(x) = -4
x -> -∞
lim f(x) = 4
x -> ∞
Recall the z-score is a function that we have assigned to the domain. While we cannot restrict or limit the domain in a usual sense, we can restrict the function that defines that domain. So it is the Maths that create the limit on the z-score, not anything arbitrary. Looking a little bit quadratic, per chance?
We will by now be used to where the Normal lies in a Standard Normal Curve. The curve itself is described by a function which we will reserve for the moment, or not: subtract mu from x
and divide by sigma. It naturally reflects about the Normal, is what we’ll reveal at this point. What do we mean by, ‘reflect’?
Hold up anything in front of a mirror. What do you see? The thing is reflected in its entirety, with some differences. Do you part your hair on the left? It looks like the right in the mirror, right? Think on that.
Numbers on the right of the Normal are positive and increasing to the right. Those on the left are negative and decreasing to the left. (a big negative number is smaller than a small one, go figure). Negative numbers get smaller by definition of negativity.
Let’s look at the x-axis
of a Standard Normal Curve:
--- -3 ---- -2 ---- -1 ---- 0 ---- 1 ---- 2 ---- 3 ---
Remember, 0 represents where the RANGE of our function is at the MEAN, thus correlates that the data points are distributed evenly on either side, so clearly the Normal.
Here is where it gets interesting: The area under the curve of this graph comes eerily close to 1. Given the upper and lower limits it can never actually be 1, but we’re still pushing at that total area.
Okay, now the question arises, what do those z-scores actually represent? Answer, Standard Deviations. Each point on the x-axis is a sigma. Our data is normally distributed across those division lines in our graph. When there are no skewing factors, such as the bias earlier described, the data falls evenly on both sides of the Normal, and is reflected equally in each Standard Deviation (graph segment between z-scores).
Starting to make sense? I doubt it, but let’s keep going. Using the Standard Normal Curve we can weight an analysis of a population or sample such that all of them fall somewhere under this curve. Universities use the Bell Curve to award grades (position the student from left to right in terms of merit) which can be a bit of a drag if you end up on the wrong side of the curve, but hey, they need some statistical approach or their funding could be ‘skewed’ to the left.
If one’s grade puts them in the z-score range greater than one Standard Deviation to the left of the Normal (meaning left of -1), forget being a passing grade since it means more than 80% of the class was ahead of that. One does not want to be in this group at the end of the term.
But how do we arrive at sigma? Unfortunately that requires some Maths and regression
which is as much as to say the square root of a bunch of squares first added up then divided by their unit count. I’m not going to get into the particulars, that is for you to do, assuming you made it this far.
Bottom line, Statistics is Maths. If that scares you then this is not a unit or a field you want to get into. Once you learn Statistics, a whole plethora of scientific fields become open for you to explore.