Honey production exercise

in the data science path, MACHINE LEARNING: SUPERVISED LEARNING :robot:
project: Honey Production


why does the lesson when when grouping the data by year is using mean instead of sum?
The solutions is:
prod_per_year = df.groupby(‘year’).totalprod.mean().reset_index()
should it be?
prod_per_year = df.groupby(‘year’).totalprod.SUM().reset_index()

Hi eyalbre,

you should ask yourself what information you want to extract from the model. Are you interested in the sum of the produced honey in 2050? Would the sum represent the trend of production per beekeeper?


1 Like

my question is about what they say they have reached at the solution. They use the mean and at the end they say they have found the total production… which sounds wrong from my point of view.

Yeah, that’s actually a bit confusing. I think this comes from the fact that the corresponding column in the data set, which refers to the production of one producer, is called ‘totalprod’, i.e. total production.

So i guess by ‘total production’ they always mean the production of one producer in the corresponding year.

But I agree, it’s a bit confusing.

I want to understand why does the predict() method in linear regression expects a 2 D array? I do understand how to apply it and not sure of the explanation behind it?

I do understand that these lines of code provides different outputs: but why is that necessary?
X_future = np.array(range(2013, 2050))
X_future = X_future.reshape(-1,1)

Good question. I commented out the .reshape() to see what would happen.

It produces an error from .predict(): “ValueError: Expected 2D array, got 1D array instead”

.predict(), for whatever reason, won’t just accept a simple list of years. It requires a 2D array, and the reshape(-1,1) puts it into a column, which is a 2D shape.