In the context of this exercise, which covers the data science process, how might we prevent data from being disproportionate as we collect it?
When a dataset is disproportionate, it means that it does not give a good representation of the overall population.
For example, a survey on national standardized testing scores might have been concentrated in affluent areas only. This would have caused bias toward higher income homes, which have a correlation to higher test scores. In order to prevent issues similar to this, there are some things you might keep in mind when collecting data.
In the case of the previous example of standardized testing, one way to prevent the issue would have been to take data from many areas, such as affluent, middle-class, and not affluent areas. In general, data should be collected from disparate sources, to prevent concentrations in certain areas which can skew the results. The more sources you can get, the better.
In addition, it is important to keep in mind all the variables that could cause disproportionate data. One possible variable in the example is the income of homes, which is most likely higher in affluent areas. There could also be other variables, such as parents’ education level.
When collecting data, it may not be possible to get a truly accurate dataset representing the population, but keeping in mind these considerations can be helpful.