Odd behavior when applying BeautifulSoup parser to Pandas Series elements

I have a Pandas Series from the OKCupid Portfolio Project. The Series elements look like HTML. For example:

print(df.series[0]) 
>>> doesn’t have kids, but might want them

I made the below parser function:

def html_parser(raw_html):
    soup = BeautifulSoup(raw_html, 'html.parser')
    soup_string = soup.get_text()
    soup_string = re.sub('<.*>', ' ', soup_string)
    return soup_string

When I try to execute:

df.series.apply(html_parser)

I get this error in the constructor of the Beautiful Soup Class:

~\anaconda3\lib\site-packages\bs4\__init__.py in __init__(self, markup, features, builder, parse_only, from_encoding, exclude_encodings, element_classes, **kwargs)
    308         if hasattr(markup, 'read'):        # It's a file-type object.
    309             markup = markup.read()
--> 310         elif len(markup) <= 256 and (
    311                 (isinstance(markup, bytes) and not b'<' in markup)
    312                 or (isinstance(markup, str) and not '<' in markup)

TypeError: object of type 'float' has no len()

When I change the parser function to the following:

def html_parser(raw_html):
    return len(soup_string)
new_series = df.series.apply(html_parser)
print(new_series[0])
>>> TypeError: object of type 'float' has no len()

def html_parser(raw_html):
    return type(soup_string)
new_series = df.series.apply(html_parser)
print(new_series[0])
>>> <class 'str'>

This confounds me because in the former I am getting a complaint which says the passed parameter is a float and therefore has no len(), and in the latter I am getting some confirmation that the passed parameter is actually a string!

I found a solution to the problem, but I would like to see if anyone can offer an explanation for the above behavior.

Here is the solution:

def html_parser(raw_html):
    soup = str(raw_html) #The addition of this line solves the problem
    soup = BeautifulSoup(raw_html, 'html.parser')
    soup_string = soup.get_text()
    soup_string = re.sub('<.*>', ' ', soup_string)
    return soup_string

You imported the requests library and bs4, right? I am confused as to why you just didn’t use the html parser there. (?)
Something like:

soup = BeautifulSoup(data, 'html.parser')
articles  = [] --or whatever you want to define here--
print(soup.prettify())

Then you can define variables and grab your data using soup.select('.tag_name_here'), no?

Then also define a function to process the page and iterate over all the data.

1 Like

Did you happen to do a pip install lxml to parse?

I found this article that mentions handling entities too:
https://lxml.de/elementsoup.html

The html_parser i defined did more than just basic parse, I confess it should probably be named html_formatter:

def html_parser(raw_html):
    soup = str(raw_html) #The addition of this line solves the problem
    soup = BeautifulSoup(raw_html, 'html.parser')
    soup_string = soup.get_text()
    soup_string = re.sub('<.*>', ' ', soup_string)
    return soup_string

I do use the bs4 parser, and then i stringify the soup object using get_text() and then I’m stripping all the html tags in preparation for Natural Language Processing.

I didn’t use soup.select because a lot of the text is not enclosed in html tags (<.*>). Instead, I’m accessing them using the Data Series Index.

I iterate over all the Series of object datatype using:

object_cols = [series for series in df.columns if df[series].dtype == 'O']
for series in object_cols:
    df[series] = df[series].apply(lambda row: html_parser(row) if pd.notnull(row) else row)

When I’m done working on the OKCupid Portfolio Project I’ll take a look at what others have uploaded and compare.

What does the data look like, exactly?

I just figured you could use soup.find('tag name').text attribute and then your data would be one long string and you could break it up from there into a df with cols. Something along the lines of what is going on here in this article.

Before I show the data I would like to clarify the legality. The project I’m working on in relation to this thread is OkCupid.
In Codecademy are all data given for “Portfolio Projects” okay to share publicly? I know that the Yelp project is not okay as per its terms and conditions. None of the other projects require agreement to a T&C - it can be automatically assumed that it can be shared?

What I can say now is that the text are not enclosed in HTML tags, which is why I am not using something like find() or select() but they contain HTML artifacts like “rsquot”, which represents ', which the BeautifulSoup Parser has successfully processed/parsed.

I didn’t yet try to read the article you linked, since it is asking me to make a free account or sign up with google/facebook, and i still have to discern if im okay with that.

I have no idea as I’ve not yet completed those projects in the DS path (I do know how to build a web scraper though b/c of prior knowledge). So, perhaps don’t post it.

Ah, right, that’s a Medium article and one gets 5 free articles a month.(and yes, you have to log in w/a gmail account for some [not all] articles. Sorry about that.
I did post another article above about how to handle entities.

Glad your code has worked out. Looking forward to seeing the final notebook!

I’m tired of seeing:


for the umpteenth time! I’ll cave and sign up.

Medium?
Yep, I feel your pain. Truth be told, I have 3 different logins there. :shushing_face: Then finally broke down and subscribed to access the DS content I wanted to read. It all used to be free a few years ago.