Are standard deviations relative to the data they are from?

Question

In the context of this exercise, are standard deviations relative to the data they are from?

Answer

Yes, this is because a standard deviation depends on the dataset values.

If you have a set of large numbers and a set of small number, both with a similar distribution shape, the standard deviation of the larger numbered dataset is most likely going to have a larger standard deviation because the values themselves are larger.

In this exercise, our data for pumpkin is roughly 10 times larger than the values in acorn_squash, with the values somewhat similarly spread out. As a result, the standard deviation will most likely be bigger for the larger numbered dataset.

If you wanted to make the comparison somewhat closer, you might consider normalizing or scaling the datasets to be around the same scale when comparing the standard deviations.

4 Likes

I thought STD was used in the context of distance from the mean in a normalized distribution. Tell me what the value of knowing the STD in the context of this exercise is.

I don’t understand this comment.
Isn’t the mean a better indication of the values in the distribution?
I have always used the standar deviaton as a messurr of how spread my data are, not how large are the number in the data.
Can you expand on this comment to clarify that, please.

4 Likes

I suppose that the author of the comment wants to highlight the fact that it is an unsafe method to compare 2 datasets in terms of their spread and for this reason to use/compare the 2 standard deviations of each dataset because of the differences that may exist in the number scale.
Standard deviation still remains a good statistic indicator of a dataset’s spread as you’ve already written but only when used in the context of the specific dataset.

1 Like

Please have a look at the following link:
https://www.codecademy.com/paths/data-science/tracks/learn-statistics-with-python/modules/variance-and-standard-deviation/lessons/standard-deviation/exercises/using-standard-deviation

It is written:
" …By finding the number of standard deviations a data point is away from the mean, we can begin to investigate how unusual that datapoint truly is. In fact, you can usually expect around 68% of your data to fall within one standard deviation of the mean, 95% of your data to fall within two standard deviations of the mean, and 99.7% of your data to fall within three standard deviations of the mean… If you have a data point that is over three standard deviations away from the mean, that’s an incredibly unusual piece of data! "

In the context of the exercise, after calculating the std’s and the means, looking back at the pumpkin array/dataset, you expect that 95% of the data will fall within 2 std’s from the mean, i.e. 95% of the pumpkins to weigh between (around) 209 & 2653.

So, although you might have been suspicious in the first place that the 1st datapoint equal to 68 (in pumpkin dataset) was not “normal” (it was minimum and particularly far away from the mean) , however now the statistical indicator std helps you prove that (68 is also very far away from 209, making us think that a pumpkin with weight equal to 68 is expected to belong not just to the 5% of data but to a much more lower %).

Maybe , after the above calculations, the judges are obliged to investigate if the pumpkin is “fake”, if it was put by the same team on purpose in order to manipulate the competition’s results (i.e. to increase the std and be the winners as the exercise’s instructions imply) !
Although I am not able to describe a “fake” pumpkin, this possibility is worth investigating :slight_smile:

3 Likes

13 posts were split to a new topic: NumPy Issue