Startup Transformation project standardization help #support

How can i standardize two columns of different dataframes?
I want to explore the relationship between Income and Productivity . But these two columns are from different dataframes and different scales

https://www.codecademy.com/paths/data-science/tracks/dscp-summary-statistics/modules/dacp-data-transformation/projects/data-transformation-project

Here is the my code
<import codecademylib3

from sklearn import preprocessing

from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

import pandas as pd

import seaborn as sns

import numpy as np

load in financial data

financial_data = pd.read_csv(‘financial_data.csv’)

code goes here

print(financial_data.head())

storing each variables

months = financial_data[‘Month’]

revenues = financial_data[‘Revenue’]

expenses = financial_data[‘Expenses’]

creating plot of revenue over past six months

plt.plot(months,revenues)

plt.xlabel(‘Month’)

plt.ylabel(‘Amount ($)’)

plt.title(‘Revenue’)

plt.show()

creating plot of expenses over last six months

plt.clf()

plt.plot(months, expenses)

plt.xlabel(‘Month’)

plt.ylabel(‘Amount ($)’)

plt.title(‘Expenses’)

plt.show()

load in expenses data

expenses_overview = pd.read_csv(‘expenses.csv’)

print(expenses_overview.head(7))

expense_categories = [‘Salaries’, ‘Advertising’, ‘Office Rent’, ‘Other’]

proportions = [0.62, 0.15, 0.15, 0.08]

creating piechart of different expense categories

plt.clf()

plt.pie(proportions, labels = expense_categories)

plt.title(‘Expense Categories’)

plt.axis(‘Equal’)

plt.tight_layout()

plt.show()

load in employee data

employees = pd.read_csv(‘employees.csv’)

print(employees.head())

Sort dataframe by Productivity column

sorted_productivity = employees.sort_values(by = [‘Productivity’])

print(sorted_productivity)

Storing first 100 rows of sorted_productivity

employees_cut = sorted_productivity.head(100)

print(employees_cut)

calculating average commute time

commute_times = employees[‘Commute Time’]

commute_times_log = np.log(commute_times)

print(commute_times.describe())

making histogram of commute time

plt.clf()

plt.hist(commute_times_log)

plt.title(‘Employee Commute Times’)

plt.xlabel(“Commute Time”)

plt.ylabel(“Frequency”)

plt.show()

exploring relationship between Income and Productivity

standardizing

productivities = employees[‘Productivity’]

scaler = StandardScaler()

standardized_productivity = scaler.fit_transform(productivities)

standardized_revenue = scaler.fit_transform(revenues)

But this code of standardization doesn’t work.

DId you get an error message like this?

“ValueError: Expected 2D array, got 1D array instead:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.”

Did you try to suggestion of reshaping the data?

Yes. Error was like this. But how to reshape . Pls write some example code

I don’t have anything but I presume it would be similar to the function from a few lessons prior to this one (Data Centering and Scaling):

def standardize(lst, mean, std_dev):
  standardized = []

  for value in lst:   
    standardized_num = (value - mean) / (std_dev)    
    standardized.append(standardized_num)

  return standardized

?

Plus, the question just asks you to put your answer in a string. (But, I understand why you’d want to figure it out I guess).

1 Like

Hi,
In the example they used two different (age and income) data before the transformation.
So if I use

income_product = employees[[‘Salary’, ‘Productivity’]]
standardized_data = scaler.fit_transform(income_product)

it works, but I’m not sure if this is the required result. I wanted to see some diagram about it:

plt.scatter(income_product[‘Productivity’], income_product[‘Salary’])

But it was no informative to me, and failed on standardized data.

1 Like

This is how i standardized the Productivity and Salary. You have to make the column name into a list and iterate through it as the standardization requires a 2-dimensional array.

scaler = StandardScaler()
income_product = ['Salary', 'Productivity']
scaler = StandardScaler()
for item in income_product:
  employees[item] = scaler.fit_transform(employees[[item]])

and you will be able to proceed with a scatterplot;

plt.scatter(employees.Salary, employees.Productivity)
plt.show()