In the context of this exercise, do box plots provide us the five number summary?
Answer
Yes, box plots display and provide all the values of a five number summary for a dataset, which are the
minimum
first quartile
median
third quartile
maximum
If we take a look at the example box plots in the exercise, we can see all the values from the five number summary of each dataset.
The lines that stretch out from each box plot at the top and bottom are known as “whiskers” and show the maximum and minimum, within a specified range, that are NOT outliers. So, these may not be the actual minimum and maximum of the data. If the minimum and maximum are outliers, you can see them as dots that are at the very top or bottom of the plot, extending beyond the whiskers.
The ‘box’ part of the box plot, which is filled in with color, shows the first quartile, median, and third quartile as three horizontal lines drawn across the bottom, middle, and top of the box.
Box plots or histograms give you a quick visual of what your data looks like
If you spot alot of outliers you may need to transform your data, ie log
Or if its skewed you may need to look at how your data was obtained
All this leads up to a statistical analysis. For this lesson, going into detail about outliers would be too much. As the purpose is to create a visual.
I find the terminology around quartile ranges a bit confusing, if the filled area is the interquartile range (middle 50% of the distribution) why don’t we refer to them as the 2nd and 3rd quartiles, or refer to the areas between the whiskers and the box as the 1st and 4th quartiles?
The box plot summarise key statistics in a box and whiskers format.
The box represents the 50% of the data, starting al Quartile 1 and finishing at Quartile 3.
The whiskers represents the range of the data, from minimum to maximum. But whiskers have a limited lenght. The can only be as long as 1.5 times the length of the box. Any data beyond that point are considered outlier.
can anyone please explain to me the syntax of the definition of labels (“label”: [“set_one”] * n … )posted below? specifically why do we use each column to * n and why do we plus them together? this excerpt is from the class exercise boxplot part II
There’s two things, one is using the * operator to repeat a list like [3] * 3 == [3, 3, 3,] and the second is using the + operator to concatenate the resulting lists. Try printing some of them out if your like, both with concatenation and separately to help it make sense.
It’s basically just the elements of the “label” column in the data frame, pd.DataFrame({"label": ["set one"] * 2 + ["set two"] * 2}) would come out like-