FAQ: Multiple Linear Regression - Training Set vs. Test Set

This community-built FAQ covers the “Training Set vs. Test Set” exercise from the lesson “Multiple Linear Regression”.

Paths and Courses
This exercise can be found in the following Codecademy content:

Machine Learning

FAQs on the exercise Training Set vs. Test Set

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply (reply) below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

Join the Discussion. Help a fellow learner on their journey.

Ask or answer a question about this exercise by clicking reply (reply) below!

Agree with a comment or answer? Like (like) to up-vote the contribution!

Need broader help or resources? Head here.

Looking for motivation to keep learning? Join our wider discussions.

Learn more about how to use this guide.

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

In the context of the Streeteasy project, the lesson shows me the following:

streeteasy = pd.read_csv(“https://raw.githubusercontent.com/sonnynomnom/Codecademy-Machine-Learning-Fundamentals/master/StreetEasy/manhattan.csv”)

df = pd.DataFrame(streeteasy)

Why is there a need to create another dataframe (df) when streeteasy is already a dataframe?

2 Likes

I understand the systematic of the proccess for obtaining the x, y train and test data but in the context of the excersice and the data we´have processed ; what is the interpretation of this arrays ? What is the conclusion given the results obtained?

There’s conflicting information in the exercises and the article on Training Set, Validation Set and Test Set.
https://www.codecademy.com/paths/machine-learning/tracks/regression-skill-path/modules/multiple-linear-regression-skill-path/lessons/multiple-linear-regression-streeteasy/exercises/training-vs-test
it says
“In general, putting 80% of your data in the training set and 20% of your data in the test set is a good place to start.”
“In general, putting 80% of your data in the training set, and 20% of your data in the validation set is a good place to start.”
Would that be for the exercise? Isn’t the test set actually an untouched data which you actually want to evaluate?

1 Like

To select all those columns… You can simply just drop the cols you don’t want

x = df.drop([‘rent’],axis=1)
y = df.rent

4 Likes

Hi! can anyone explain to me what setting a random_state integer is doing in the train_test_split? Thank you!

1 Like

According to the documentation, train_test_split seems to shuffle the datasets before splitting them by default. If we don’t specify the random_state, shuffling is random. If we specify random_state, the same shuffling will be reproduced so that the splitting is done in the same way across multiple function calls.

1 Like

Thank you! I’ve seen that written on the ww3 school too but had some trouble understanding the phrase haha. So does that mean given the same data and same random_state integer, the training sets across different runs will include identical data? Can I think of the random_state integers as a notation of a way to split?

Yes, if you set the same data and the same random_state integer, the same datasets will be output across different runs.

Probably this function shuffles the given dataset and returns a certain percentage (1 - test_size) from the beginning of the shuffled dataset as training dataset, and the rest as test dataset. If we set the random_state to a fixed integer, the result of the shuffling will be fixed, so we will get the same dataset.

1 Like

Dope dope, thank you!

hello…why we use df[]?? is it similar with reshape(-1,1).Thanks you for you reply!

I’m a little confused about something: Why did we not use Validation Set here? And, when it is appropriate to use the Validation Set? Do we make a choice between Validation Set and Test Set?

I’m confused on the difference between df[“rent”] and df[[“rent”]] aren’t they both data frames with only the “rent” column why do I need the second set of [ ] to get the correct answer?