Project : Census Variables

datalocky · July 19, 2021, 10:57am

Hello !
Could someone share with me the code on how to solve this part of the project?

> * Create a new variable called marital_codes by Label Encoding the marital_status variable. This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.

tgrtim · July 19, 2021, 11:05am

As this is a learning environment simply asking for code if frowned upon, the community guidelines are worth viewing-
https://discuss.codecademy.com/faq

You’d likely get a better response by following the guidance laid out in the following FAQ on how best to set up questions (by and large these forums typically follow more of a Q&A style especially under “get-help”-

byte7345305615 · September 15, 2021, 11:10am

Dear fellow coders,

Please find here the solution for the census variables project

import codecademylib3

# Import pandas with alias
import pandas as pd

# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

#1
print(census.head())

# 3
print(census.dtypes)

# 4
print(census.birth_year.unique())

# 5. 
census['birth_year'] = census['birth_year'].replace('missing', 1967)
print(census['birth_year'].head())

#6
census['birth_year'] = census['birth_year'].astype('int')
#8
print(census['birth_year'].mean())

# 9

# converting type of columns to 'category'
census['higher_tax'] = census['higher_tax'].astype('category')

#  encoding
census['higher_tax'] = census['higher_tax'].cat.codes
print(census.higher_tax.unique)

# print out the median of the higher_tax variable
print(census['higher_tax'].median()) 

# 10
census = pd.get_dummies(census, columns = ['marital_status'] )

print(census.head())

yunies · September 27, 2021, 2:20am

Hi,

ths is my code, Appreciate your feedback, thank

trantai494 · September 27, 2021, 2:22pm

Thanks for your sharing.

jadaandersen · December 2, 2021, 7:36am

Hi guys, here’s my solution.

import codecademylib3
import numpy as np
# Import pandas with alias
import pandas as pd

# Read in the census dataframe
census = pd.read_csv('census_data.csv', index_col=0)

#print(census.head())
#print(census.dtypes)

#check why birth year was classed as an object
#print(census['birth_year'].unique())

#change the missing data to 1967
census['birth_year'] = census['birth_year'].replace(['missing'], 1967)
#change birth year data type to int
census['birth_year'] = census['birth_year'].astype('int')
#print(census.dtypes)

#average birth year
#print(census['birth_year'].mean())

#convert higher_tax to categorical
census['higher_tax'] = pd.Categorical(census['higher_tax'], ['strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'], ordered=True)
print(census['higher_tax'].unique())

census['higher_tax'] = census['higher_tax'].cat.codes
print(census.head())

# median for higher tax is neutral
#print(census['higher_tax'].median())


#census = pd.get_dummies(census, columns=['marital_status'])
#print(census.head())

#print(census['marital_status'].unique())
census['marital_status'] = pd.Categorical(census['marital_status'], ['single', 'married', 'divorced', 'widowed'], ordered=True)
census['marital_codes'] = census['marital_status'].cat.codes

#Create an age column and group them by 5 year intervals
census['age'] = 2021 - census['birth_year']
age_bins = np.arange(min(census['age'])- 4, 100, 5)
census['age_group'] = pd.cut(census['age'], bins=age_bins)

#Python recognises the age_group variable as categorical. Encode age groups.
census['age_group'] = census['age_group'].cat.codes

print(census.head())

yagsa · June 6, 2023, 11:53am

Hello there,
I have 2 doubts in this learning project-

I couldn’t understand what this code does-

census['higher_tax']= census['higher_tax'].cat.codes
print(census['higher_tax'].median())
#output is 2.0

OR
Can you please explain what is cat.codes, I checked the documentation but couldn’t grasp the concept.
2. I couldn’t understand the median output as well.

Please help. Will appreciate.

lisalisaj · June 6, 2023, 12:37pm

You’re creating a column called higher_tax and then converting the categorical variables to numbers. cat.codes returns a series of numbers (and the index) that correspond with the categorical variables in that column, higher_tax.

So, this:

'strongly disagree', 'disagree', 'neutral', 'agree', 'strongly agree'

df: index & response:
0  strongly disagree
1  disagree
2  neutral
3  agree
4  strongly agree

Becomes something like:

[1, 2, 3, 4, 5]

index, column in df:
0 1
1 2
2 3
3 4
4 5

Which you can then perform descriptive stats on. In this case using median to obtain the median value (response) of that column. So, you found that 2.0 is the median response to that question in the census.

“Codes are an array of integers which are the positions of the actual values in the categories array.” (from the docs below)

Docs:
https://pandas.pydata.org/docs/reference/api/pandas.Categorical.codes.html

yagsa · June 6, 2023, 6:28pm

Ahh so, the median output is ‘Neutral’. Got it now.
Thank you @lisalisaj! This was helpful! : )

lisalisaj · June 6, 2023, 6:34pm

you’re welcome.
Median is the middle value in the data sample that separates 50% of the lower values and 50% of the higher values. It’s the central tendency. (And, sometimes it’s more accurate than the mean, especially if there are outliers involved in the data.)