Do box plots provide the five number summary?

Question

In the context of this exercise, do box plots provide us the five number summary?

Answer

Yes, box plots display and provide all the values of a five number summary for a dataset, which are the

minimum
first quartile
median
third quartile
maximum

If we take a look at the example box plots in the exercise, we can see all the values from the five number summary of each dataset.

The lines that stretch out from each box plot at the top and bottom are known as “whiskers” and show the maximum and minimum, within a specified range, that are NOT outliers. So, these may not be the actual minimum and maximum of the data. If the minimum and maximum are outliers, you can see them as dots that are at the very top or bottom of the plot, extending beyond the whiskers.

The ‘box’ part of the box plot, which is filled in with color, shows the first quartile, median, and third quartile as three horizontal lines drawn across the bottom, middle, and top of the box.

3 Likes

how can we define outliers? there are many dot. why not include this questions in lesson?

2 Likes
  • Box plots or histograms give you a quick visual of what your data looks like

  • If you spot alot of outliers you may need to transform your data, ie log

  • Or if its skewed you may need to look at how your data was obtained

All this leads up to a statistical analysis. For this lesson, going into detail about outliers would be too much. As the purpose is to create a visual.

3 Likes

I find the terminology around quartile ranges a bit confusing, if the filled area is the interquartile range (middle 50% of the distribution) why don’t we refer to them as the 2nd and 3rd quartiles, or refer to the areas between the whiskers and the box as the 1st and 4th quartiles?

3 Likes

box-plot-300x196

  • Bottom black horizontal line of blue box plot is minimum value
  • First black horizontal line of rectangle shape of blue box plot is First quartile (FQ) or 25%
  • Second black horizontal line of rectangle shape of blue box plot is Second quartile (SQ) or 50% or median.
  • Third black horizontal line of rectangle shape of blue box plot is third quartile (TQ) or 75%
  • Top black horizontal line of rectangle shape of blue box plot is maximum value.
  • Small diamond shape of blue box plot is outlier data or erroneous data.

Here’s the link you may find helpful and editing in the image is done by me - Boxplot

2 Likes

Expanding more on quartiles, here’s a wikipedia article I found very useful.

Also, I found displaying the boxplot like this offered a clearer visualization of the data (but that’s just a personal preference):

sns.boxplot(data=df, x='value', y='label')

Captura de pantalla 2020-06-12 a las 11.16.00

Cheers! :beer:

1 Like

The box plot summarise key statistics in a box and whiskers format.
The box represents the 50% of the data, starting al Quartile 1 and finishing at Quartile 3.
The whiskers represents the range of the data, from minimum to maximum. But whiskers have a limited lenght. The can only be as long as 1.5 times the length of the box. Any data beyond that point are considered outlier.

What does the value actually represent for the y-axis?
How it is calculated as I see some different values in CSV files?
Please can someone brief me??

This source might be helpful to understand the terms and concepts of whisker, quartiles, and the box.

can anyone please explain to me the syntax of the definition of labels (“label”: [“set_one”] * n … )posted below? specifically why do we use each column to * n and why do we plus them together? this excerpt is from the class exercise boxplot part II

set_one = np.genfromtxt(“dataset1.csv”, delimiter=",")
set_two = np.genfromtxt(“dataset2.csv”, delimiter=",")
set_three = np.genfromtxt(“dataset3.csv”, delimiter=",")
set_four = np.genfromtxt(“dataset4.csv”, delimiter=",")

n=500
df = pd.DataFrame({
“label”: [“set_one”] * n + [“set_two”] * n + [“set_three”] * n + [“set_four”] * n,
“value”: np.concatenate([set_one, set_two, set_three, set_four])
})

There’s two things, one is using the * operator to repeat a list like [3] * 3 == [3, 3, 3,] and the second is using the + operator to concatenate the resulting lists. Try printing some of them out if your like, both with concatenation and separately to help it make sense.

It’s basically just the elements of the “label” column in the data frame, pd.DataFrame({"label": ["set one"] * 2 + ["set two"] * 2}) would come out like-

label
0 set one
1 set one
2 set two
3 set two

I see. then can i interpret the n in [“set_one”] * n as however many of rows of set_one there are?