Is there a better way to classify values in a DataFrame?

Hello, I am having a little of a pinch in this project

My problem resides in problem b of the last task which goes as follows:

  • Create a new variable called age_group , which groups respondents based on their birth year. The groups should be in five-year increments, e.g., 25-30 , 31-35 , etc. Then label encode the age_group variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.

I wrote the following code and thankfully it worked, however I find it really inefficient and time consuming to make this lengthy if/elif function just to classify the values into different age groups. is there a more elegant way to code the parse_values function?

def parse_values(birth_year):
  x = 2021 - birth_year
  if x <= 15:
    return '10-15'
  elif x > 15 and x <= 20:
    return '15-20'
  elif x > 20 and x <= 25:
    return '20-25'
  elif x > 25 and x <= 30:
    return '25-30'
  elif x > 30 and x <= 35:
    return '30-35'
  elif x > 35 and x <= 40:
    return '35-40'
  elif x > 40 and x <= 45:
    return '40-45'
  elif x > 45 and x <= 50:
    return '45-50'
  elif x > 50 and x <= 55:
    return '50-55'
  elif x > 55 and x <= 60:
    return '55-60'
  elif x > 60 and x <= 65:
    return '60-65'
  elif x > 65 and x <= 70:
    return '65-70'
  elif x > 70 and x <= 75:
    return '70-75'
  elif x > 75 and x <= 80:
    return '75-80'
  elif x > 80 and x <= 85:
    return '80-85'
  
census['age_group'] = census.birth_year.apply(parse_values)
ages = census.age_group.unique()
census.age_group = pd.Categorical(census.age_group, ages, ordered=True)
census['age_group_codes'] = census.age_group.cat.codes

Thanks to everyone who is willing to help a priori

@fancyboii
There is almost always another (if not better) way to accomplish something when it comes to coding.
However, trying to optimize your code too soon is usually a bad idea.

You also have to take into consideration the readability of your code and how long it takes you to formulate a given solution.

The way you solved this problem is totally valid and it’s easy to read (though your labels do make it seem like there is an overlap – '45-50', '50-55'). It also took you no more than a couple minutes to type that out, especially if you copy-pasted each elif statement.

There is value in that, and it shouldn’t be underestimated. Both the time you save by writing a quick solution and the time your teammates save when reading through clear code translates into money saved by your employer. Not to mention the time you save when you have to debug your code later.

So yes, you could have come up with a solution like this…

import math

census['age'] = census['birth_year'].apply(lambda x: 2021 - x)

def parse_age_group(age):
    base = 5

    def nearest_multiple(number, base):
        return base * math.ceil(number / base)

    max_age = nearest_multiple(age, base)
    return f'{max_age - 4} - {max_age}'

census['age_group'] = census['age'].apply(parse_age_group)
print(census.head())

…but you could have written your original solution 10 times over before figuring out how to do it this “shorter” way. Also, it only takes a second to glance at your code and know what it does. Whereas, this code may take someone a few minutes to wrap their head around.

Just something to think about :slightly_smiling_face:

Happy coding!

4 Likes

Thank you very much for yout help!

Yeah I wondered abpout that, and well thank you for your great observations.
I corrected the labels, thank you for pointing that our haha, I did everything on a haste because I wanted to post this here and couldn’t wrap my head about a better way to code this :sweat: .

I do have a question about the code you wrote, what does the f' in the return f'{max_age - 4} - {max_age}' do? Is that like a shortform of the .format() method? If so, where can I look some more of those shortcuts?

Again thank you for your help! :smiley:

It’s known as an f-string. It is similar to using .format(), but not entirely the same. Each has its strengths and weaknesses. In most cases though, f-strings are the better — and more readable — option (assuming you’re using Python 3.6 or higher).

Check out this Real Python article for a good introduction to using f-strings.

2 Likes