[Machine Learning] The Very Real Dangers of Improper Evaluation and Validation

If we are not careful we may end up fooling ourselves or getting the wrong takeaways from looking at an article or a peer’s project.

In a perfect world, your machine learning training data will be neat, tidy, and well-balanced, but in reality we often get data that is skewed, messy, and poorly formatted. These factors come into play in analysis, hypothesis testing, visualization, and even in machine learning.

When it comes to evaluation metrics, we have a range of choices. The choice of metric determines the kind of error and flaw that the evaluation will be sensitive to. One could also look at all the metrics rather than focusing on choosing one, in an exploratory sort of manner, wherein you want to probe where your model is weak. Regression models and classification models have their own subset of evaluation metrics.

It may not be immediately apparent when something is wrong with your own ML work or someone else’s work. Here’s an example. You might see that your model has great accuracy, only to find upon closer inspection that making guesses works just as well, which would be a typical case when dealing with a categorical label distribution that is heavily skewed.

It is similar to how a clinical study for effectiveness and safety of a drug can conclude that a drug can be used, only for it to be rejected after peer-review finds that the experiment was not properly setup or properly evaluated.

In all Codecademy projects we are encouraged to share our work and request feedback. Take advantage of the community. By the way the community does not magically run by itself. You are the community. We aspiring Data Scientists need to treat the Project thread like a peer-review journal publication.

I also have a suggestion:

  • Hold live virtual events, an “ML Roast” where some brave consenting souls offer up their completed projects to be constructively reviewed live in real-time. You also get an additional platform for showcasing your project. There are already videos on Youtube where popular Data Scientist Youtubers review portfolio projects.

Things to consider before declaring a project or exercise complete:

  • Ensure that you are not mixing up inputs and outputs of your models and of your validation and evaluation functions. Especially if you are reusing variable names. You could even have separate variable names so that you don’t have to worry about making sure which variables are in which namespaces.
  • Ensure that you are not mixing up data. You could be mixing up your training, validation, and test sets.
  • Check that you are not mixing up data that has been transformed and untransformed, or forgetting to apply the necessary transforms or backtransforms at certain stages. For instance, if you transformed your training set before training, you should also transform your test set before predicting.
  • Check what features the model considered, and how strongly it considered each, in order to make its predictions. In an excellent video by 3Blue1Brown, it is demonstrated that although a neural network did seem to make correct predictions, it was not learning what we were hoping it would learn and gave confident predictions when fed random nonsense.
  • Like stated above, try feeding the model random nonsense. Or dummies (see DummyClassifier and DummyRegressor.
  • Use the metric appropriate for answering the initial question or business goal. E.g. in screening for a disease you may not want to fail to miss anyone who could be positive but wouldn’t mind incorrectly labelling a few healthy people positive.

Things to consider when looking at ML resources:

  • Check the API docs (e.g. sklearn docs) if what is being discussed in the article is also covered. Does it match? There are cases where the articles do not follow the official recommended documented procedure as outlined in the docs.
  • Compare between articles. There can be differences of opinion, especially given the experimental nature of ML methodology.


Disclaimer: I am not a seasoned ML practitioner. Comment if there are issues with this writeup.