Web Scraping (Chocolate Scraping with Beautiful Soup Q12)

I am creating a DataFrame but am getting a ValueError:

import codecademylib3_seaborn from bs4 import BeautifulSoup import requests import pandas as pd import matplotlib.pyplot as plt import numpy as np webpage = requests.get("https://content.codecademy.com/courses/beautifulsoup/cacao/index.html") soup = BeautifulSoup(webpage.content, "html.parser") soup.find_all(attrs={"class", "Rating"}) ratings = [] for rating in soup.select(".Rating")[1]: ratings.append(float(rating)) plt.hist(ratings) plt.show() comp_tags = soup.select(".Company") companies = [] for name in comp_tags[1:]: companies.append(name.get_text()) d = {"Company": companies, "Rating": ratings} df = pd.DataFrame.from_dict(d)

Not quite sure what would throw the error here…

It mentions which line causes the error, that doesn’t mean the issue starts from there, things can often propagate, but it is a good place to start looking. It mentions that certain arrays are not the same length; maybe it’s worth having a close look at the contents of d.

Thanks, that’s a really helpful tip. I think my error goes back to line 22 where I’m having trouble converting ‘rating’ to a float().

When it refers to arrays needing to be the same length, is the array the column of data? I’m not sure what is referred to here when the program refers to an array.

1 Like

Just in case you’re not familiar with them, a lot of pandas options are based on/related to numpy which uses arrays as a data type- numpy.ndarray — NumPy v1.21 Manual. These aren’t restricted to a single type or a single dimensions but in most simpler cases are sequences of a single type (e.g. 500 inits), which allows for a lot of memory and speed improvements for mathematical operations (pandas will try to use something like this for a DataSeries if it can).

Back to pandas again, typical DataFrames act like 2D tabular data (database or spreadsheet style) pandas.DataFrame — pandas 1.3.2 documentation. The docs even mention you can often treat DataFrames as a dictionary for DataSeries objects- pandas.Series — pandas 1.3.2 documentation. Each series would then contain some kind of array for its rows.

As the DataFrame is 2D tabular data the number of rows should match (even if that means sticking in NaNs or similar).

So back to the actual error :grin:, if you’re passing a dictionary to pandas it will typically treat the keys as column names and the values as the rows for that column. So it’s very useful to know what’s in the dictionary d as the keys and values are expected to become a tabular data structure.

could be possible build web scraping from jupyter-dash mate?

I haven’t heard of jupyter dash-mate. Have you used it before?