Same thing happened here.
Also the labels in the last table don’t change for Population (ax.set_xlabel(“City Population”)) and Age
(ax.set_ylabel(“User Age”)).
Did anyone else think the story told by the histogram and scatter plots was conflicting?
In the scatter plot the trend line showed a correlation between increasing age and increasing population size, but the histograms definitively showed that the mean age of people living in lower population (rural) areas was higher compared to those in higher (urban) population settings.
I was wondering where I might be able to download the csv files used in the exercises, so I could play with the data using Jupyter notebooks and try and replicate the results. Thanks!
Understanding this is 2 years old and there may not be a response to my reply, I agree. Isn’t the data initially implying the older the age the more rural the location? Shouldn’t the population of a city decrease with a user’s age? The scatter plot shows a trend line of slightly older folks living in larger populations. Whereas, the histogram showed the opposite.
You have to take into consideration the purpose of the two different charts and the outliers within the data.
For the histogram and box charts the rural area is defined as anything less than 100,000 people and urban is anything greater. Therefore If you look in the plot from this exercise you’d note that in the line representing 100k there are a few outlier areas where the age is rather high and it drags the average for area to be overall just over 30. This is shown in the violin and histograms from earlier exercises.
Now, if you factor in all of the urban areas, every place except that one line on the plot where the population is 100k, the average for all of those is 29ish.
But this plot is not showing the average for all of those individual places, it breaks them down into population based segments, averages for each of those points.
Take into account just the points for the cities where the population is over 8m. It’s a rather high average age there, though it’s only a few points.
The regression line is an attempt to try and fit a straight line from left to right with the least amount of error between the line and the means for each of those points. Clearly at each point there is a margin between what the average is and where the line runs but given the average at 2m 4m and 8m are a fair bit higher than elsewhere, the line is trending up as the city population increases.
Going back to the original point, if we had rural, suburban, and urban, where subruban was say 100k to 2m people, you’d find that the mean age in rural would be 31ish, the mean age in suburban may be 28ish and then the mean for urban, with the new definition of cities with more than 2m people, might be 33