FAQ: Why Data Science? - A Day with the Inference Team

This community-built FAQ covers the “A Day with the Inference Team” exercise from the lesson “Why Data Science?”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Data Science Foundations
Data Scientist: Inference Specialist
Data Scientist: Machine Learning Specialist
Data Scientist: Natural Language Processing Specialist
Data Scientist: Analytics Specialist

FAQs on the exercise A Day with the Inference Team

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!
You can also find further discussion and get answers to your questions over in Language Help.

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head to Language Help and Tips and Resources. If you are wanting feedback or inspiration for a project, check out Projects.

Looking for motivation to keep learning? Join our wider discussions in Community

Learn more about how to use this guide.

Found a bug? Report it online, or post in Bug Reporting

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

Hey there!
I just don’t get how it works. This mean and standard deviation, what are they about? What do they mean? Why should the distribution change? If we have these learning hours for codecademy, how and why are they changing when we change mean and srtandard deviation? How does it work?

2 Likes
  • The mean is a statistic measure that give us a value that can summarize all our data. And plot will move to the right while the mean gets bigger.

  • The standard deviation is responsible for telling us how distributed our data are. So if you increase it, the plot will look wider.

1 Like

Part of the reason you might not understand it is because the normal distribution is not labeled with any values.

In statistics, the mean, most commonly is denoted as µ (which is pronounced "mue’). It represents the average value or the center of the distribution. It indicates the location of the peak or center of the bell-shaped curve.

The standard deviation, represented most often as σ, pronounced “sigma.” This measures the spread or dispersion of the data points around the mean. It provides a measure of how much the values vary from the mean. A smaller standard deviation indicates that the data points are closer to the mean, resulting in a narrower and taller bell curve. A larger standard deviation means the data points are more spread out, leading to a wider and flatter curve.

Together, the mean and standard deviation fully define a normal distribution. By knowing these two parameters, you can understand the characteristics of the distribution such as the probability of values falling within certain “ranges” or the likelihood of observing particular values.

I recommend reviewing some intro statistics information. Try Khan Academy. I think they’re free. Or OpenIntro. Also free. That one has free textbooks. I think real soon I’m gonna have to review stats myself!

1 Like

There is a word that sneaks in there that is just a wee bit deceiving (not purposely, of course): 'Normal'. Why is it called ‘normal distribution’?

In planar geometry where we have two axes, x and y, there are four quadrants divided by the intersecting axes such that both x and y are zero at the point of intersection, the Origin.

Now, given that we can only assume that the distribution of Real numbers above and below the x-axis and left and right of the y-axis is exactly even, the space in the graph is exactly cut into four quadrants.

We learn that values along the x-axis are independent, and given the name, domain. This is the expanse of our independently arising data, or x-value.

Conversely, the y-value is a function of x and within the bounds of the function defined as a given range. So we have ‘domain’ and ‘range’ as descriptive terms for our graph.

That is graphing basics, but we’re not done. We wouldn’t have a graph if we didn’t have a function. The graph describes the curve of the function, or in some/many cases, the curve of the relation or simply put, the locus of the curve. Forget all this stuff for now.

A line is the most elementary curve there is. It has zero curvature, so why do we still call it a curve? Well, it is a locus. It describes a path. A line is a path from point A to point B. That makes it a curve.

Back to this even distribution of the Cartesian Plane (the 2-D thing we opened with) it follows that the y-axis cuts the x-axis exactly in half since we said everything is distributed evenly about the Origin. Given this exemplary status, the y-axes has earned the special title, Normal.

Thinking about all the above, we may conclude that the Normal is always intersecting the x-axis. Bring statistics into the frame and now we have a specialized use for this 2-D framework.

First, the Normal is always the statistical mean (aka average) of a sample or population data set. Population means everything and sample means segment. We may not be able to get everyone’s opinion on something but we can get all the medical data on everyone on the planet. That’s the basic difference if you see these terms in your reading.

If you have 3000 students in your school and you polled all of them, giving you 3000 data points, that would also qualify as a population. From that you could derive segments that would be random samples from the data set. Selective samples have questionable statistical value given the bias, which will be reflected in a Standard Normal Curve of that data. It will be skewed one way or the other, and not render with any balance or symmetry.

The cat is out of the bag, nowthat I used that term, ‘Standard Normal Curve’ and can only hope that the reader is making a connection to the above described graph in the planar coordinate system. Statistics is really just a system of math with built in constraints, one of which is the Normal. It depends on having a midpoint which is conveniently provided by the Origin so long as there is some way to translate the data to reflect about that point.

To the rescue comes the z-score which is calculated about the mean (mu). The z-score of mu is zero. Voila, distribution about the Normal. Takes a minute to think about this but just remember that under typical circumstances, all Real numbers are evenly distributed about both axes, and more particularly for our purposes, about the y-axis, the Normal.

The funny thing about z-score is that it can never quite reach negative 4 on the left of the Normal, nor can it reach plus 4 on the right of the Normal. The function that derives this value is computed from both mu (µ) and sigma (σ) (Standard Deviation) that is reliably (intrinsically) bound by those lower and upper limits.

We’re sneaking into the Calculus but only slightly. An elementary concept of calculus is, limits, or what we commonly refer to as, ‘limiting value of f(x)’. In other words the function will never exceed the range of the limit on either the right side (approaching from the right) or the left side (approaching from the left), or both.

If f(x) is the function that defines the z-score, then,

lim f(x) = -4
x -> -∞
lim f(x) = 4
x -> ∞

Recall the z-score is a function that we have assigned to the domain. While we cannot restrict or limit the domain in a usual sense, we can restrict the function that defines that domain. So it is the Maths that create the limit on the z-score, not anything arbitrary. Looking a little bit quadratic, per chance?

We will by now be used to where the Normal lies in a Standard Normal Curve. The curve itself is described by a function which we will reserve for the moment, or not: subtract mu from x and divide by sigma. It naturally reflects about the Normal, is what we’ll reveal at this point. What do we mean by, ‘reflect’?

Hold up anything in front of a mirror. What do you see? The thing is reflected in its entirety, with some differences. Do you part your hair on the left? It looks like the right in the mirror, right? Think on that.

Numbers on the right of the Normal are positive and increasing to the right. Those on the left are negative and decreasing to the left. (a big negative number is smaller than a small one, go figure). Negative numbers get smaller by definition of negativity.

Let’s look at the x-axis of a Standard Normal Curve:

--- -3 ---- -2 ---- -1 ---- 0 ---- 1 ---- 2 ---- 3 ---

Remember, 0 represents where the RANGE of our function is at the MEAN, thus correlates that the data points are distributed evenly on either side, so clearly the Normal.

Here is where it gets interesting: The area under the curve of this graph comes eerily close to 1. Given the upper and lower limits it can never actually be 1, but we’re still pushing at that total area.

Okay, now the question arises, what do those z-scores actually represent? Answer, Standard Deviations. Each point on the x-axis is a sigma. Our data is normally distributed across those division lines in our graph. When there are no skewing factors, such as the bias earlier described, the data falls evenly on both sides of the Normal, and is reflected equally in each Standard Deviation (graph segment between z-scores).

Starting to make sense? I doubt it, but let’s keep going. Using the Standard Normal Curve we can weight an analysis of a population or sample such that all of them fall somewhere under this curve. Universities use the Bell Curve to award grades (position the student from left to right in terms of merit) which can be a bit of a drag if you end up on the wrong side of the curve, but hey, they need some statistical approach or their funding could be ‘skewed’ to the left.

If one’s grade puts them in the z-score range greater than one Standard Deviation to the left of the Normal (meaning left of -1), forget being a passing grade since it means more than 80% of the class was ahead of that. One does not want to be in this group at the end of the term.

But how do we arrive at sigma? Unfortunately that requires some Maths and regression which is as much as to say the square root of a bunch of squares first added up then divided by their unit count. I’m not going to get into the particulars, that is for you to do, assuming you made it this far.

Bottom line, Statistics is Maths. If that scares you then this is not a unit or a field you want to get into. Once you learn Statistics, a whole plethora of scientific fields become open for you to explore.

4 Likes

I would add that being a data scientist (or data analyst) is like being a detective or, archaeologist–b/c you’re ultimately a storyteller–you’re uncovering patterns and correlations (and maybe [hopefully] causality) in the data. If you’re not good with math, that doesn’t mean you shouldn’t go into data. You don’t need to be a math major to go into data. You do have to be open and willing to learn it and understand it and how to apply it.

2 Likes

If data could be categorized as fauna records in the lithograph, it would surely be fixed. Data in real time terms is anything but archeological (or paleological). It’s spewing out like flood basalt on a shield volcano. However, I agree that there is a story to be told if the observer is able to suss it from the data, which data would be qualified and constrained under experimental criteria.

Data based on social criteria is not only difficult to collect, but suspect in so many ways, thus very hard to objectify. An analyst with skills in this area might see patterns and trends where the math doesn’t immediately reveal it. On this basis I must walk back my statement and make Maths partly optional. Correct, we do not need to be Maths majors. Anyone who is smart enough to deal with the above described problem has someone around them who can do the math.

I more meant figuratively…in the sense of discovery & uncovering things. That, and like an archaeologist or paleontologist would use a brush to carefully remove debris from a fossil or artifact, one is also careful with the data in deleting empty cells for example or NULLs b/c they may be important to the overall analysis.

As for social data or questionnaires–yea, they can be difficult to collect, but that’s why creating (unbiased) survey questions is a skill to have/be learned as well. With demographic data (Census data), you can uncover patterns, but you won’t know until you do some random sampling surveys for more specifics. I’m just thinking about all of this as one w/a graduate degree in sociology and via my own experiences. TBH, the only time math ever really made sense to me was when I took stats in grad school. Why? b/c I finally had an excellent professor and was dealing with data that I actually cared about. :slight_smile:
:woman_technologist: :bar_chart: :chart_with_upwards_trend: