Cleaning US Census Data: data cleaning with pandas

I’m not sure this is a problem or just me being particularly narrow in my thinking. Throughout the exercises leading up to this project and in this project the instructions have stated to view both the df.column and they df.dtype information(see picture).

My question is why ask for both? if print(df.dtype) returns a list with all the column names and the data type wouldn’t this accomplish the goals of both commands? I can see the argument that we aren’t always needing to find out the data type which is why .column exists but does it really need to?

@chrisnosky5924999070,

Excellent question.

Codecademy likely has you use both df.columns and df.dtypes so that you become familiar with both properties of a DataFrame.

As far as why Pandas includes both, I’d guess that there are a number of reasons. First, dtypes and columns are both parameters that you can use when you are creating a DataFrame object (see documentation here).

As for their various uses outside of DataFrame creation, it’s important to understand what each returns when you call it. df.dtypes returns a Pandas Series object containing the data types, with the column names as the index.

df.columns, on the other hand, returns a Pandas Index object containing a list of the column names as strings. You can use this as you would any other list of strings. Let’s say you want to print out the first value of each column. You can do this with df.columns:

for column in df.columns:
  print(df[column][0])

You could also perform more complex calculations or manipulations using df.columns if you wanted to.

For example:

for col in df.columns:
  # print mean of numeric columns
  if df.dtypes[col] in ['int64', 'float64']:
    print('Column: {}'.format(col))
    print('Mean: {}'.format(df[col].mean()))

  # print number of True values in boolean columns
  elif df.dtypes[col] == 'bool':
    print('Column: {}'.format(col))
    print('Total True: {}'.format(sum(df[col])))

  # change the first value in object columns to "First value!"
  elif df.dtypes[col] == 'object':
    df[col][0] = 'First value!'
  
  # do nothing for columns with other data types
  else:
    continue

print(df.head())  # Print the DataFrame to see changes

Hopefully this helps you understand when df.columns might be useful. Happy coding!

That’s a good enough answer for me, as I said I was primarily curious why it typically had us do both in these projects. Your answer reminds me that the limited lens which I’ve been exposed to thus far isn’t all that Python and the various libraries are used for. Thanks!