Hi,
The overwhelming presence of the “-1” for undisclosed income data represents about 73% of the responses and it has rendered it extremely difficult to examine relationship between income, ethnicity and orientation.
I have learned as much as possible, or so I think, about cleaning data.
For my charts and percentile analyses, I can extract the valid data and produce results, though it means I must work with a small sample size. I even created an interactive chart that allows the user to look at income averages based on cross-sections of orientation and ethnicity or multi-ethnic combinations.
I am working hard here guys. But I am struggling to produce meaningful results with linear regression. I had previously determined there was no linear formula, but a polynomial one. Still, the use and possible meaning of these special -1s to represent undisclosed data that I don’t find in other columns is stuck in my head.
Does it mean anything special or is the income data that incomplete, using the -1 instead of NAN for no special reason that dictates how I should handle it? This may be one of those “dumb” questions I need to ask to clear my head.
But why use -1 instead of Nan?
I have learned to weed out both nan and non-finite numbers, as well as other advanced cleaning techniques, but those -1 values keep popping back up even when I think I have effectively filtered them out.
My filtering techniques have included everything I could find online via straight, Boolean techniques and Python pandas that identify null, nan and isfinite values. I can’t find much relevant information online, and this issue was not covered in the curriculum as I recall. Any advice or insights would be greatly appreciated.
Thank you,
Robert Pfaff
There really is no relevant link to leave, so I will post the code:
Here I have commented out the option to treat inf values as na because I kept getting an error message that the linear regression module in sklearn could not run my model because of Nan values. That confused me for quite some time until I realized the problem was “infinite” values, which I have never really heard of.
** Commented out leaving -1s in the dataset **
(with pd.option_context(‘mode.use_inf_as_na’, True):
feature_data.dropna(inplace=True)
print(len(feature_data))
Below, I have posted the rest of my linear regression code. I hope I have NOT overwhelmed you, and my question makes an iota of sense. My mind is a bit fried at the moment. Thanks again.
#Y is the target column, X has the features
X = feature_data[[‘orientation’]]
y = feature_data[‘income’]#Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)print(“Scaled Datasets X and Y”)
print(len(X_train_scaled))
print(len(X_test_scaled))
print()
print(len(y_train))
print(len(y_test))Output:
Scaled Datasets X and Y
28658
7165
Linear regression - gives formula for correlation.
In regression, we must remove NAN valuescreating a regression model
mlr = LinearRegression()fitting the model with training data
mlr.fit(X_train_scaled,y_train)print(f"coeficient for linear regression model is: {mlr.coef_}“)
print(f"intercept for our model is: {mlr.intercept_}”)predictions = mlr.predict(X_test_scaled)
print()
> model evaluation
print(
'mean_squared_error : ', mean_squared_error(y_test, predictions))
print(
'mean_absolute_error : ', mean_absolute_error(y_test, predictions))
print()r_squared = mlr.score(X_test_scaled, y_test)
print(f"R2 value of our model is: {r_squared}")
Output:
coefficient for linear regression model is: [-4161.89094351]
intercept for our model is: 25612.324446925813
mean_squared_error : 13246984576.179667
mean_absolute_error : 41150.7720104241
R2 value of our model is: 0.001647549928928016