Date-A-Scientist Capstone Project: scraping a string

Hi,

I’ve only just started with this project and I’m already stuck and I would really appreciate some help.

So in this project you get tons of info about male applicants on a dating site, including the zodiac sign. I’m trying to write some code that will predict the sign of a user based on other habits and characteristics described in the dataframe. There’s a column for sign (named ‘sign’), but here’s the thing: not only it shows the sign but also how the user feels about sharing it, all in one string.

So instead of just “aries” is “aries and it’s not important” or “aries and it’s really important”.

I figured let’s create a column that assigns a certain numerical value to each user according to his sign, regardless of the user’s feelings about it. That is, detect what zodiac sign is contained in the string in the ‘sign’ column, and fill the corresponding row with a number, as showed in the following dict:

sign_categories = {'aries':1, 'taurus':2, 'gemini':3, 'cancer':4, 'leo':5, 'virgo':6, 'libra':7, 'scorpio':8, 'sagittarius':9, 'capricorn':10, 'aquarius':11, 'pisces':12}

This way “aries”, “aries and it’s not important” and “aries and it’s really important” would all be a 1 in the new column.

First I turned keys and values into lists:

sign_keys = list(sign_categories)
sign_values = list(sign_categories.values())

And then I tried applying a lambda function inside a loop to get it done:

for i in range(12):
   df['sign_category'] = df.sign.apply(lambda x: sign_values[i] if sign_keys[i] in x)

But it won’t work. This gets a syntax error, and if I add an else statement it goes “argument of type ‘float’ is not iterable”. :pensive:

If I don’t add the if statement it runs just fine, but of course it doesn’t provide the desired result. So my guess is that the problem I’m having is how to properly scrape the data I want from the string in the ‘sign’ column (but then again I don’t really really know).

So what I’d like to no know is what is it that I’m doing wrong, and of course if there’s a simpler, more convenient way of doing this, since I happen to have the vague suspicion this could be way easier and my lack of experience is preventing me from seeing it.

You’ll find the wole code below. Thank you in advance and excuse my English, haven’t written anything in this beautiful language in a while.

Best regards,
Federico

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

df = pd.read_csv('profiles.csv')
df.head()

df.sign.dropna(inplace=True)
df.sign.value_counts()

sign_categories = {'aries':1, 'taurus':2, 'gemini':3, 'cancer':4, 'leo':5, 'virgo':6, 'libra':7, 'scorpio':8, 'sagittarius':9, 'capricorn':10, 'aquarius':11, 'pisces':12}

sign_keys = list(sign_categories)
sign_values = list(sign_categories.values())

for i in range(12):
   df['sign_category'] = df.sign.apply(lambda x: sign_values[i] if sign_keys[i] in x)