What do the horizontal and vertical axes represent in a KDE plot?

Question

In the context of this exercise, what do the horizontal and vertical axes represent in a KDE plot?

Answer

For this lesson, the KDE plots we work will be using univariate data. So, only one of the axes will represent actual values in the data.

The horizontal or x-axis of a KDE plot is the range of values in the data set. This is similar to the x axis for histograms.

The vertical or y-axis of a KDE plot represents the Kernel Density Estimate of the Probability Density Function of a random variable, which is interpreted as a probability differential. The probability of a value being between the points x1 and x2 is the total shaded area of the curve under the two points.

3 Likes

what y axis number mean?

the y-axis is the probability density function

What is Probability Density Function

frequency in layman term

Okay so the replies here seem rather curt.

The X values are simple, simply the values we pass. Remember that the values are only one dimensional so basically the range over which they occur. If the data were 1 3 5 4 3 6 4 9 7 5 4 7 6 6 2 5 4 4 6 5 8 or something like that, the X values would be from 1-10

The Y values here represent the probability that a given X value will be at that range. If you don’t understand probability distributions, you can do some reading more in depth here. But basically it should be “normalized” which means that if you add up all the probabilities for all the X values it should equal 1, meaning 100% that it will fall under the curve.

Understanding this involve calculus so if you don’t know that, just know that calculating the area from a given x to another x under the curve will be the total probability that an X value falls in that range. When we talk about confidence intervals, we mean that we want the X values that will make the total area 0.95 or a 95% chance that a given new value will appear within that range.

This is the foundation of a lot of statistics and getting this idea solid is crucial for understanding lots of data analysis and pretty much all modeling.

13 Likes

Hey, thanks for the explanation. I still don’t understand how this is a lot different to a histogram, where the Y values in a histogram represent the frequency that x falls within a given range (bin). Histograms can also be normalised. So what’s the real advantage of using a KDE plot of X vs PDE over a super smooth, normalised, histogram (with fine bins)?

I gave KDE plots vs histograms a quick google and was a little bit traumatised by the amount of hairy maths involved in various links that try to explain the difference. I guess the topic is just really complicated?

Below is my amateur understanding. If this was wrong please point it out to me so that I learn!

To me it looks like the concept of resolution would be a good analogy.

Let’s say a certain edge is jagged and has serrations(teeth, in simpler terms). However, the size of the serrations are different, some large and some small.
You might be tasked with counting the number of serrations.

Taking a picture of that edge at a low resolution, we would be able to make out easily the large serrations but fail to count the small ones, or even to realize that small ones exist. Taking a picture at high resolution, we would be able to see that there are actually small serration that we had previously not noticed and add that to the count, and then we can increase the resolution to see if we can find yet more smaller serrations until we feel like it is no longer necessary to look for even smaller ones.

We can think of the number of bins of the histogram as its resolution. But what exactly is the resolution of a KDE, where are the bins?
A related term to differential is infinitesimal, portmanteau of “infinitely small”. The KDE tries to estimate what it thinks the histogram would look like if the bins of the histogram was of infinitely small width, which means how the histogram would look like at infinite resolution.

This is why use KDE as opposed to just histogram with many bins as to make it fine. It doesn’t just make the number of bins high but shows you how it looks like when we approach infinity of the number of bins.

TLDR
It will allow us to identify even minute bumps and peaks and valleys in the outline of the univariate dataset’s distribution, as if the resolution was infinite.

Additional Recommended Research:
To see how “infinite bins” is really a thing that calculus does, try reading about Simpson’s Rule and Trapezoidal Rule for estimating area under a curve, which involves placing a finite number of bins under the curve in the given interval. And then integral calculus comes along and does what Simpson’s Rule and Trapezoidal Rule does but with infinite number of bins.

histogram is to Simpson’s Rule as KDE is to Integration

2 Likes