@core4436729277,
Excellent questions.
You should not try to optimize random_state
, as it is a parameter that is included for the sole purpose of making sure your tests are reproducible. If you don’t include anything for random_state
, then each time you (or someone else) runs your code, train_test_split
will split your data in a different random way for your training/test sets and the outcome will be different each time. Including an integer as an argument for random_state
ensures that so long as you use that same integer, your training data and test data will always be split the same way, and thus produce the same results.
In an ideal world, you shouldn’t see too much of a difference between results with different random_state
values, but this will be entirely dependent on the size and uniformity of the data (in the lesson, Codecademy did say there would be a little more variance than usual because this dataset isn’t very large). There will, however, always be some difference simply due to which X% of the data ended up in the training set and which X% ended up in the test set.
In this case, you split the data 80%/20% for training/test sets. However, the data points in that 80% training set will be different based on whether your random_state
was set to 100 or to 42. So, it follows that you should see a little difference in the accuracy depending on which data points your model was trained on.
This directly ties into your second question. Because train_test_split
will keep randomly selecting data points to add to your training and test sets until that 80/20 split is reached, the “nearest neighbors” will be different depending on which random_state
integer you supply (or they will vary each time you run your code if a random_state
wasn’t provided). Since your “nearest neighbors” will differ depending on where train_test_split
put your data points, it may take more or less of these neighbors (k
) to make an accurate prediction. The “levels” that you see in the graphs directly correspond to this, as they represent where a certain k
is underfitting (large k
— not good on the training data and also cannot be generalized to predict new data) or overfitting (small k
— does well on the training data but has poor performance with new data).
Hopefully this helps clear up some of your confusion.
Happy coding!