In the context of this exercise, what are some other things to keep in mind when cleaning data?
Answer
The cleaning data step of the data process can be a very important step because it makes the data organized and usable for the purpose of evaluation and analysis.
When cleaning data, we usually want to make sure to normalize and categorize the data. This reduces redundancy and removes any duplicate or repeated values in the data. This can be done by splitting a table or dataframe into separate tables or dataframes. Pandas includes some useful methods to do this such as dropna() which can remove rows or columns that have NaN or None values.
Furthermore, we can also remove missing or invalid data, if it reasonable to do so, that might otherwise mess up our evaluation of it.
Hi, I am sorry but I can´t figure out what is the issue when adding the column location to the new data frame. All I get is a data frame with two columns and the second one just says âFalseâ. any Idea on what can be the cause? Thank you
if you do not add a key (ie the âonâ parameter), it assumes you are merging by the index and these two tables have the same indices so that is how they are being merged seamlessly
In this example, the .loc method is being used to create a new column called âlocationâ. It is in the form df.loc[:,'New_Column '] = âvalueâ. The value here is the new classification of rural or urban.
merging is nifty! kind of annoying that the texts need to be EXACTLY the same though. I tested by removing a space from âBrooklyn, NYâ in the data set so it was âBrooklyn,NYâ and it just skips Brooklyn entirely when you run.
Do these data usually come nicely and neatly packaged this way ready for merging?
This is unrelated, but how do I copy the user_data and pop_data csv file into pycharm to run the code on a seperate platform? Do I create a python file and paste the data?
Why does the example only take the overall population of a city to make the classification for rural / urban? Wouldnât it be better approach to take the population density as well into account to retrieve a more accurate value (or is it for example purposes only )?
Urban centre* : must have a minimum of 50,000 inhabitants plus a population density of at least 1500 people per square kilometre (km2) or density of build-up area greater than 50 percent.
Urban cluster : must have a minimum of 5,000 inhabitants plus a population density of at least 300 people per square kilometre (km2).
We learned this simple âformulaâ in around Grade 8, give or take. It was never explored very deeply and only in scalar terms. We didnât even get to learn the term, ârectilinearâ. The hardest thing we had to do was resolve for the unknown, and Iâm not sure we even went to much length with that, either. All very general.
Jump forward three or four grades and everything changes. Now both F and a are vectors and we must resolve the forces on both the x and y (and later z) axes. I would have really dug a force table in grade 8 science. But lo, had to wait years to actually play with one, and do the mathâcopious math assignments on this subject.
But I digress. We never really got to explore the units since they were always a given. Just do the math and youâll have the right answer. Thatâs Junior High science.
We now know that the unit of Force is Newtons which unit-wise translates to a constant, N*s**2*kg**-1*m**-1 which when we analyze the dimensions cancels everything out but N. This is the primary goal of dimensional analysis⌠To arrive at the correct units.
Beauty is, we donât need any values so can use 1 as a placeholder. Given what we have above, we can now revise the formula to more resemble the equation that is the reality unfolded of this concept, at least to a small degree.
That was my thought as well being a geographer. But I assume itâs just a matter of expediency and simplicity since this is an introductory lesson in data cleaning and manipulation.
Yes, just create new CSV files in PyCharm or a new file with a .csv extension, -and then paste the data into that file. The pop_data.csv file is large, so you may have to change PyCharm file size settings to allow it. You could also download the file from the simplemaps website back in the Get Data section of the lesson and then import it into PyCharm.
Iâm only getting 3 rows of data after merging the population table when doing the exercise (table is below). Has the csv file changed? Any other reasons? Iâm seeing from previous posts that the print command returned the 15 rows.
i think the code works, but i donât see the new column indicating whether itâs Urban or Rural. the table has too many columns and we just canât scroll far enough to the right. (donât tell me to get a second monitor i already know i need one lol)