Hello everyone! I am doing the Regression models module, and I am trying to fit a linear line in this cloud of green dots:
However, as you can see, the “blue line” is not fitting perfectly the cloud.
As you can see, the axes are in log scale, and this is why I tried to create the line based not directly on the values, but on the log of the values, as you can see in the code.
I tried also without the log values, that is why you can see the result in 2 pictures with and without the log values but it appeared as multiple lines instead of only one.
Would you recommend me any other way to fit this cloud of dots, considering the logarithmic scale? Why is it appearing as several lines instead of only one?
# Load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
filtered_data= pd.read_csv("data.csv")
# Apply log transformation to the 'Data' column
log_data = filtered_data.copy()
log_data['Data'] = log_data['Data'].apply(lambda x: np.log(x) if x > 0 else np.nan)
log_data['Revenue'] = log_data['Revenue'].apply(lambda x: np.log(x) if x > 0 else np.nan)
# Plot the scatter plot with the linear regression on top
predicted_emissions = results.params[0] + results.params[1]*log_data['Revenue']
# Plot the actual data points
plt.scatter(filtered_data['Revenue'], filtered_data['Data'], c="green", s=1, label='Actual gross emissions', alpha=0.5)
plt.plot(filtered_data['Revenue'], predicted_emissions)
# Set logarithmic scale for both axes
plt.xscale("log")
plt.yscale("log")
# Add labels and title
plt.xlabel("Revenue (Million Euros)", fontsize=12, fontweight="bold")
plt.ylabel("Gross emissions (tCO2e)", fontsize=12, fontweight="bold")
plt.title("Gross emissions vs Revenue", fontsize=14, fontweight="bold")
plt.legend(loc='upper left')
![scatter_line|578x458](upload://6cG10hY6CLt1hcawXi8dQYF6R9u.png)
plt.show()