Random forest exercise

https://www.codecademy.com/paths/data-science/tracks/dspath-supervised/modules/decision-trees/projects/ranodm-forest-income

I am getting an error when i run the program using the 5 labels that are used as input for variable data .

If you are on step 9 then you should get an error.Read again instructions for this step.

If you are on one of the next steps - post your code and post the error message.

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

income_data=pd.read_csv("income.csv",header=0,delimiter=",")
print(income_data.iloc[0])

labels=income_data[[" income"]]
data = income_data["age", " capital-gain", " capital-loss", " hours-per-week", " sex"]
Error
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2890, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('age', ' capital-gain', ' capital-loss', ' hours-per-week', ' sex')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "income.py", line 15, in <module>
    data = income_data["age", " capital-gain", " capital-loss", " hours-per-week", " sex"]
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2975, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2892, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ('age', ' capital-gain', ' capital-loss', ' hours-per-week', ' sex')

We’ll also want to pick which columns to use when trying to predict income. For now, let’s select "age" , "capital-gain" , "capital-loss" , "hours-per-week" , and "sex" . Create a new variable named data that contains only those columns. The syntax for this is very similar to selecting only one column:

many_columns = data_frame_name[["a", "b", "c"]]

In this example, many_columns now contains the columns "a" , "b" , and "c" from data_frame_name .

You should use double brackets. Double brackets allow us to pass a list of column labels into the __getitem__ method of the dataframe.

1 Like

Thank you for your assistance,really appreciate it , it wasnt working double brackets either yesterday i think it was a problem in the system

1 Like

You’re very welcome :slight_smile:

I am still having problems in updating columns i have checked the help video there is no difference i am still getting that huge error. it is the line where data = data = income_data[[“age”, " capital-gain", " capital-loss", " hours-per-week", “sex-int”]]

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

income_data=pd.read_csv("income.csv",header=0,delimiter=",")
print(income_data.iloc[0])

income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

labels=income_data[[" income"]]

data = income_data[["age", " capital-gain", " capital-loss", " hours-per-week", "sex-int"]]

train_data,test_data,train_labels,test_labels=train_test_split(data, labels, random_state=1)
forest=RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
print(forest.score(test_data, test_labels))

can you please help me as I cant move forward with exercise until and unless i solve this problem

The problem is in inconsistent naming of the columns. If you want to use delimiter="," that’s fine, but you need to use name " sex", not "sex" in this line:

income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

The better approach would be to use delimiter=", " and not adding the leading space:

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

income_data=pd.read_csv("income.csv",header=0,delimiter=", ")
print(income_data.iloc[0])

income_data["sex-int"] = income_data["sex"].apply(lambda row: 0 if row == "Male" else 1)

labels=income_data[["income"]]

data = income_data[["age", "capital-gain", "capital-loss", "hours-per-week", "sex-int"]]

train_data,test_data,train_labels,test_labels=train_test_split(data, labels, random_state=1)
forest=RandomForestClassifier(random_state=1)
forest.fit(train_data, train_labels)
print(forest.score(test_data, test_labels))

Thank you i had to give space for age as well

I would still suggest you to use delimiter=", ". This will make your life a little bit easier.

Currently your code runs, but it is incorrect. Take a look at this line:

income_data["sex-int"] = income_data[" sex"].apply(lambda row: 0 if row == "Male" else 1)

The problem is that sex will never have value "Male" if you use delimiter=",", According to your processed dataframe all the records are female. To correct this you have to add leading space to "Male" -> " Male". But you see how abstract and nonintuitive this is, right? It would be better to simply use delimiter=", ".

I’m having a similar problem as I get this error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ' sex'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "income.py", line 14, in <module>
    income_data["sex-int"] = income_data[" sex"].apply(lambda row: 0 if row == "Male" else 1)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 2995, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py", line 2899, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: ' sex'

This is my current code:

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

income_data = pd.read_csv("income.csv", header=0, delimiter=", ")
print(income_data.iloc[0])

income_data["sex-int"] = income_data[" sex"].apply(lambda row: 0 if row == "Male" else 1)

labels = income_data[[" income"]]
data = income_data[["age", " capital-gain", " capital-loss", " hours-per-week", " sex-int"]]

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, random_state = 1)
forest = RandomForestClassifier(random_state = 1)
forest.fit(train_data, train_labels)

What should I change?