What are some common things to do when cleaning data?

Question

In the context of this exercise, what are some other things to keep in mind when cleaning data?

Answer

The cleaning data step of the data process can be a very important step because it makes the data organized and usable for the purpose of evaluation and analysis.

When cleaning data, we usually want to make sure to normalize and categorize the data. This reduces redundancy and removes any duplicate or repeated values in the data. This can be done by splitting a table or dataframe into separate tables or dataframes. Pandas includes some useful methods to do this such as dropna() which can remove rows or columns that have NaN or None values.

Furthermore, we can also remove missing or invalid data, if it reasonable to do so, that might otherwise mess up our evaluation of it.

44 Likes

Hi, I am sorry but I can´t figure out what is the issue when adding the column location to the new data frame. All I get is a data frame with two columns and the second one just says “False”. any Idea on what can be the cause? Thank you

can you give a screenshot ?

2 Likes

With no ‘Key’ during the merge, how are the populations being added to the correct row?

1 Like

if you do not add a key (ie the “on” parameter), it assumes you are merging by the index and these two tables have the same indices so that is how they are being merged seamlessly

4 Likes

The “key” for both data set is the city name.

I also don’t understand how the new column getting populated when I paste the code.

Hi there,

In this example, the .loc method is being used to create a new column called “location”. It is in the form df.loc[:,'New_Column '] = ‘value’. The value here is the new classification of rural or urban.

3 Likes

and here’s the link to pandas description of the .loc method

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html?highlight=loc#pandas.DataFrame.loc

2 Likes

merging is nifty! kind of annoying that the texts need to be EXACTLY the same though. I tested by removing a space from “Brooklyn, NY” in the data set so it was “Brooklyn,NY” and it just skips Brooklyn entirely when you run.

Do these data usually come nicely and neatly packaged this way ready for merging?

1 Like

This is unrelated, but how do I copy the user_data and pop_data csv file into pycharm to run the code on a seperate platform? Do I create a python file and paste the data?

My browser window, where you see these tables, is completely blank. I tried refreshing the webpage and rerunning the code, but it is still blank.

2 Likes

Why does the example only take the overall population of a city to make the classification for rural / urban? Wouldn’t it be better approach to take the population density as well into account to retrieve a more accurate value (or is it for example purposes only :slight_smile: )?

After some research I found a definition of urban and rural areas on here: https://ourworldindata.org/urbanization

Urban centre* : must have a minimum of 50,000 inhabitants plus a population density of at least 1500 people per square kilometre (km2) or density of build-up area greater than 50 percent.

  • Urban cluster : must have a minimum of 5,000 inhabitants plus a population density of at least 300 people per square kilometre (km2).
  • Rural : fewer than 5,000 inhabitants.
3 Likes

Is it common for data analysts to carry out dimensional analysis to reduce the number of variables and simplify the data analysis process?

The goal of dimensional analysis would be to make sure the data ends up in the right column (has the right units).

1 Like

Aside

F = ma

We learned this simple “formula” in around Grade 8, give or take. It was never explored very deeply and only in scalar terms. We didn’t even get to learn the term, ‘rectilinear’. The hardest thing we had to do was resolve for the unknown, and I’m not sure we even went to much length with that, either. All very general.

Jump forward three or four grades and everything changes. Now both F and a are vectors and we must resolve the forces on both the x and y (and later z) axes. I would have really dug a force table in grade 8 science. But lo, had to wait years to actually play with one, and do the math–copious math assignments on this subject.

But I digress. We never really got to explore the units since they were always a given. Just do the math and you’ll have the right answer. That’s Junior High science.

We now know that the unit of Force is Newtons which unit-wise translates to a constant, N*s**2*kg**-1*m**-1 which when we analyze the dimensions cancels everything out but N. This is the primary goal of dimensional analysis… To arrive at the correct units.

Beauty is, we don’t need any values so can use 1 as a placeholder. Given what we have above, we can now revise the formula to more resemble the equation that is the reality unfolded of this concept, at least to a small degree.

F = kma

     N-s^2
k = -------
     kg-m

That was my thought as well being a geographer. But I assume it’s just a matter of expediency and simplicity since this is an introductory lesson in data cleaning and manipulation.

Yes, just create new CSV files in PyCharm or a new file with a .csv extension, -and then paste the data into that file. The pop_data.csv file is large, so you may have to change PyCharm file size settings to allow it. You could also download the file from the simplemaps website back in the Get Data section of the lesson and then import it into PyCharm.

I’m only getting 3 rows of data after merging the population table when doing the exercise (table is below). Has the csv file changed? Any other reasons? I’m seeing from previous posts that the print command returned the 15 rows.

2 Likes

i think the code works, but i don’t see the new column indicating whether it’s Urban or Rural. the table has too many columns and we just can’t scroll far enough to the right. (don’t tell me to get a second monitor i already know i need one lol)