KNearestNeighbor Breast Cancer Classifier

KNN Breast Cancer Classifier

Hi there

Can anyone clarify a few things for me?:

  1. How to arrive at the optimum ‘random state’ - the attached images below show 3 different random states used for plotting which ‘k’ results in highest prediction accuracy. There doesn’t appear to be a pattern to the peaks of each plot at point ‘k’ when using different random states. For me this makes it difficult to decide which ‘k’ is optimal as the algorithm performs differently for different 'k’s using different random states. Is the point of this exercise that, for this particular dataset, ‘k’ is arbitrary?

Snip1 Snip2 Snip3

  1. The plots show a number of distinct levels of accuracy suggesting that the scale as the plot jumps from one level of accuracy to another. What is the significance of these distinct levels with regards to the way the KNeighborClassifier algorithm works?

Many thanks :slight_smile:


Excellent questions.

You should not try to optimize random_state, as it is a parameter that is included for the sole purpose of making sure your tests are reproducible. If you don’t include anything for random_state, then each time you (or someone else) runs your code, train_test_split will split your data in a different random way for your training/test sets and the outcome will be different each time. Including an integer as an argument for random_state ensures that so long as you use that same integer, your training data and test data will always be split the same way, and thus produce the same results.

In an ideal world, you shouldn’t see too much of a difference between results with different random_state values, but this will be entirely dependent on the size and uniformity of the data (in the lesson, Codecademy did say there would be a little more variance than usual because this dataset isn’t very large). There will, however, always be some difference simply due to which X% of the data ended up in the training set and which X% ended up in the test set.

In this case, you split the data 80%/20% for training/test sets. However, the data points in that 80% training set will be different based on whether your random_state was set to 100 or to 42. So, it follows that you should see a little difference in the accuracy depending on which data points your model was trained on.

This directly ties into your second question. Because train_test_split will keep randomly selecting data points to add to your training and test sets until that 80/20 split is reached, the “nearest neighbors” will be different depending on which random_state integer you supply (or they will vary each time you run your code if a random_state wasn’t provided). Since your “nearest neighbors” will differ depending on where train_test_split put your data points, it may take more or less of these neighbors (k) to make an accurate prediction. The “levels” that you see in the graphs directly correspond to this, as they represent where a certain k is underfitting (large k — not good on the training data and also cannot be generalized to predict new data) or overfitting (small k — does well on the training data but has poor performance with new data).

Hopefully this helps clear up some of your confusion.

Happy coding!