# FAQ: Accuracy, Recall, Precision, and F1 Score - F1 Score

This community-built FAQ covers the “F1 Score” exercise from the lesson “Accuracy, Recall, Precision, and F1 Score”.

Paths and Courses
This exercise can be found in the following Codecademy content:

## FAQs on the exercise F1 Score

There are currently no frequently asked questions associated with this exercise – that’s where you come in! You can contribute to this section by offering your own questions, answers, or clarifications on this exercise. Ask or answer a question by clicking reply () below.

If you’ve had an “aha” moment about the concepts, formatting, syntax, or anything else with this exercise, consider sharing those insights! Teaching others and answering their questions is one of the best ways to learn and stay sharp.

## Join the Discussion. Help a fellow learner on their journey.

Agree with a comment or answer? Like () to up-vote the contribution!

Found a bug? Report it!

Have a question about your account or billing? Reach out to our customer support team!

None of the above? Find out where to ask other questions here!

What should be the accuracy, recall, precision and F1 Score to say that our model is good? TIA

I think you’ve already found the answer yourself, but, for others who will have the same question, here’s a quote from the description of the next exercise:

The decision to use precision, recall, or F1 score ultimately comes down to the context of your classification. Maybe you don’t care if your classifier has a lot of false positives. If that’s the case, precision doesn’t matter as much.

As long as you have an understanding of what question you’re trying to answer, you should be able to determine which statistic is most relevant to you.

For example, let’s say we’re talking about tests to find out if a person has a particular disease. Then each statistic means:

• Accuracy: Percentage of actually correctly determining whether one has the disease.
• Recall: Percentage of people who are determined to be positive out of people who actually have the disease.
• Precision: Percentage of people who actually have the disease out of people who are determined to be positive.
• F1 Score: Comprehensive evaluation of recall and precision.

We learned in this lesson that accuracy alone is not enough. Even if accuracy is high, a test is not necessarily good. If the probability of having the disease is very low, even a test which always determines to be negative will have a high accuracy.

We want all of four statistics being high but there is a trade-off between recall and precision, so usage of them will differ depending on what we are focusing on. For example, if preventing false positives is important to us, we should focus on increasing precision. Alternatively, if we want to find people who have the disease, without overlooking as possible, we should focus on recall.

1 Like

Hi! Could you explain then how the harmonic means are better way of averaging? and should we always strive for lower f1 score? Also, why are we not using accuracy as part of our f1 score since it seems impossible to paint the whole picture without any of these measurements?

If we want to calculate the average of multiple ratios, we need to be careful what the numerator and denominator of each ratio are. In particular, when those ratios share a same denominator, the arithmetic mean is suitable, and, when those ratios share a same numerator, the harmonic mean is suitable. For example:

• What is the average speed if a car runs at 20 km/h for a certain amount of time and then at 60 km/h for the same amount of time? - The answer is 40 km/h, which is the arithmetic mean of 20 km/h and 60 km/h.

• What is the average speed if a car runs a certain distance at 20 km/h and then the same distance at 60 km/h? - The answer is 30 km/h, which is the harmonic mean of 20 km/h and 60 km/h.

Since recall and precision share the numerator (true positives), the harmonic mean is suitable.

1 Like

I think that the explanation in the exercise gives one example that the arithmetic mean results differently from our intuition. For example, suppose there is a disease test that classifies everyone positive. If only 1% of population actually have the disease, then recall would be 1 and precision would be 0.01.We would intuitively evaluate this test much worse than 0.505 and would feel 0.019 is better evaluation.

This explanation, however, expects readers to have the same intuition and cannot persuade a reader who does not.

1 Like

oh wow thanks for the detailed answer! Yeah I read the f1 score 0.505 as pretty bad too and i think that’s why I was confused.

1 Like