What does cat.codes do?


I’m doing the Data Scientist: Natural Language Processing Specialist career path, and I’ve gotten to the point where we’re doing variable types. On the variable types review page, it asks me to use cat.codes, but it never went over this in the lesson. Is anyone willing to explain exactly what cat.codes means and does?

cat.codes assigns a numerical value to the (ordinal) categorical variables. It returns an array of numbers that are paired with the categorical variables (as well as the index). This allows you to do summary statistics on a column of data.
See the Pandas documentation:


And more here on categorical variables:


Example from the Summary Statistics lesson on NYC Tree census data:

health_categories = ['Poor', 'Fair', 'Good']

nyc_trees['health'] = pd.Categorical(nyc_trees['health'], health_categories, ordered=True)

median_index = np.median(nyc_trees['health'].cat.codes)

median_health_status = health_categories[int(median_index)]


Here is a link to the lesson:


