How to Handle Income Data in Machine Learning Capstone Project?


The overwhelming presence of the “-1” for undisclosed income data represents about 73% of the responses and it has rendered it extremely difficult to examine relationship between income, ethnicity and orientation.

I have learned as much as possible, or so I think, about cleaning data.

For my charts and percentile analyses, I can extract the valid data and produce results, though it means I must work with a small sample size. I even created an interactive chart that allows the user to look at income averages based on cross-sections of orientation and ethnicity or multi-ethnic combinations.

I am working hard here guys. But I am struggling to produce meaningful results with linear regression. I had previously determined there was no linear formula, but a polynomial one. Still, the use and possible meaning of these special -1s to represent undisclosed data that I don’t find in other columns is stuck in my head.

Does it mean anything special or is the income data that incomplete, using the -1 instead of NAN for no special reason that dictates how I should handle it? This may be one of those “dumb” questions I need to ask to clear my head.

But why use -1 instead of Nan?

I have learned to weed out both nan and non-finite numbers, as well as other advanced cleaning techniques, but those -1 values keep popping back up even when I think I have effectively filtered them out.

My filtering techniques have included everything I could find online via straight, Boolean techniques and Python pandas that identify null, nan and isfinite values. I can’t find much relevant information online, and this issue was not covered in the curriculum as I recall. Any advice or insights would be greatly appreciated.

Thank you,

Robert Pfaff

There really is no relevant link to leave, so I will post the code:

Here I have commented out the option to treat inf values as na because I kept getting an error message that the linear regression module in sklearn could not run my model because of Nan values. That confused me for quite some time until I realized the problem was “infinite” values, which I have never really heard of.

** Commented out leaving -1s in the dataset **

(with pd.option_context(‘mode.use_inf_as_na’, True):

Below, I have posted the rest of my linear regression code. I hope I have NOT overwhelmed you, and my question makes an iota of sense. My mind is a bit fried at the moment. Thanks again.

#Y is the target column, X has the features

X = feature_data[[‘orientation’]]
y = feature_data[‘income’]

#Split the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

print(“Scaled Datasets X and Y”)


Scaled Datasets X and Y

Linear regression - gives formula for correlation.
In regression, we must remove NAN values

creating a regression model
mlr = LinearRegression()

fitting the model with training data,y_train)

print(f"coeficient for linear regression model is: {mlr.coef_}“)
print(f"intercept for our model is: {mlr.intercept_}”)

predictions = mlr.predict(X_test_scaled)

> model evaluation
'mean_squared_error : ', mean_squared_error(y_test, predictions))
'mean_absolute_error : ', mean_absolute_error(y_test, predictions))

r_squared = mlr.score(X_test_scaled, y_test)

print(f"R2 value of our model is: {r_squared}")


coefficient for linear regression model is: [-4161.89094351]
intercept for our model is: 25612.324446925813

mean_squared_error : 13246984576.179667
mean_absolute_error : 41150.7720104241

R2 value of our model is: 0.001647549928928016

Using nan or -1 is contextual. If you’re data mining, some algorithms do not work well with null data (how important is it to use those particular algorithms, are there alternatives that could work in this situation as well?), in which case one has to make a decision as to what to do with the value. Do you impute (then if so, with what value?), do you drop the row (but this could introduce bias depending on the reason behind the nulls). There is no clear-cut one solution that covers every situation.

Then there’s also the consideration that nan is represented as a float. Floats tend to take up more memory than ints. Potentially, depending on what you’re doing and what hardware you’re doing it with, you could run into performance limitations.

Some technical stuff:

There’s also the question of how much of your data set this impacts. If it’s a considerable amount, and it’s an fundamental data point in what you’re calculating, this means you probably need to have a much larger data set to make the incomplete entries weigh less.

I do think that making the decision is not trivial. These considerations are important in good data preparation. Which is right will depend more on what you have in mind with the algorithms in the pipeline.

1 Like

Thank you for responding to my question.

It confirmed what I suspected to be true, but did not want to assume. I am going to treat the -1s as Nan values because they represent about 73% of the income data. Imputing the mean does not make much sense to me, since it’s almost certainly not even close to the real mean. I can’t impute any value for that many missing values with much confidence, and we have no information in the instructions as to why they were used instead of Nans. I appreciate your time, and I will review those links now.

Thank you again for your response.


1 Like

Definitely agreed that imputation is not really the way to go here :slight_smile:

1 Like