Hi,

The overwhelming presence of the “-1” for undisclosed income data represents about 73% of the responses and it has rendered it extremely difficult to examine relationship between income, ethnicity and orientation.

I have learned as much as possible, or so I think, about cleaning data.

For my charts and percentile analyses, I can extract the valid data and produce results, though it means I must work with a small sample size. I even created an interactive chart that allows the user to look at income averages based on cross-sections of orientation and ethnicity or multi-ethnic combinations.

I am working hard here guys. But I am struggling to produce meaningful results with linear regression. I had previously determined there was no linear formula, but a polynomial one. Still, the use and possible meaning of these special -1s to represent undisclosed data that I don’t find in other columns is stuck in my head.

Does it mean anything special or is the income data that incomplete, using the -1 instead of NAN for no special reason that dictates how I should handle it? This may be one of those “dumb” questions I need to ask to clear my head.

But why use -1 instead of Nan?

I have learned to weed out both nan and non-finite numbers, as well as other advanced cleaning techniques, but those -1 values keep popping back up even when I think I have effectively filtered them out.

My filtering techniques have included everything I could find online via straight, Boolean techniques and Python pandas that identify null, nan and isfinite values. I can’t find much relevant information online, and this issue was not covered in the curriculum as I recall. Any advice or insights would be greatly appreciated.

Thank you,

Robert Pfaff

There really is no relevant link to leave, so I will post the code:

Here I have commented out the option to treat inf values as na because I kept getting an error message that the linear regression module in sklearn could not run my model because of Nan values. That confused me for quite some time until I realized the problem was “infinite” values, which I have never really heard of.

** Commented out leaving -1s in the dataset **

(with pd.option_context(‘mode.use_inf_as_na’, True):

feature_data.dropna(inplace=True)

print(len(feature_data))

Below, I have posted the rest of my linear regression code. I hope I have NOT overwhelmed you, and my question makes an iota of sense. My mind is a bit fried at the moment. Thanks again.

#Y is the target column, X has the features

X = feature_data[[‘orientation’]]

y = feature_data[‘income’]#Split the data into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.fit_transform(X_test)print(“Scaled Datasets X and Y”)

print(len(X_train_scaled))

print(len(X_test_scaled))

print()

print(len(y_train))

print(len(y_test))Output:

Scaled Datasets X and Y

28658

7165

Linear regression - gives formula for correlation.

In regression, we must remove NAN valuescreating a regression model

mlr = LinearRegression()

fitting the model with training data

mlr.fit(X_train_scaled,y_train)print(f"coeficient for linear regression model is: {mlr.coef_}“)

print(f"intercept for our model is: {mlr.intercept_}”)predictions = mlr.predict(X_test_scaled)

print()

> model evaluation

print(

'mean_squared_error : ', mean_squared_error(y_test, predictions))

print(

'mean_absolute_error : ', mean_absolute_error(y_test, predictions))

print()r_squared = mlr.score(X_test_scaled, y_test)

print(f"R2 value of our model is: {r_squared}")

Output:coefficient for linear regression model is: [-4161.89094351]

intercept for our model is: 25612.324446925813

mean_squared_error : 13246984576.179667

mean_absolute_error : 41150.7720104241

R2 value of our model is: 0.001647549928928016