You should not try to optimize
random_state, as it is a parameter that is included for the sole purpose of making sure your tests are reproducible. If you don’t include anything for
random_state, then each time you (or someone else) runs your code,
train_test_split will split your data in a different random way for your training/test sets and the outcome will be different each time. Including an integer as an argument for
random_state ensures that so long as you use that same integer, your training data and test data will always be split the same way, and thus produce the same results.
In an ideal world, you shouldn’t see too much of a difference between results with different
random_state values, but this will be entirely dependent on the size and uniformity of the data (in the lesson, Codecademy did say there would be a little more variance than usual because this dataset isn’t very large). There will, however, always be some difference simply due to which X% of the data ended up in the training set and which X% ended up in the test set.
In this case, you split the data 80%/20% for training/test sets. However, the data points in that 80% training set will be different based on whether your
random_state was set to 100 or to 42. So, it follows that you should see a little difference in the accuracy depending on which data points your model was trained on.
This directly ties into your second question. Because
train_test_split will keep randomly selecting data points to add to your training and test sets until that 80/20 split is reached, the “nearest neighbors” will be different depending on which
random_state integer you supply (or they will vary each time you run your code if a
random_state wasn’t provided). Since your “nearest neighbors” will differ depending on where
train_test_split put your data points, it may take more or less of these neighbors (
k) to make an accurate prediction. The “levels” that you see in the graphs directly correspond to this, as they represent where a certain
k is underfitting (large
k — not good on the training data and also cannot be generalized to predict new data) or overfitting (small
k — does well on the training data but has poor performance with new data).
Hopefully this helps clear up some of your confusion.