A confusion matrix summarizes a binary classifier’s outcomes into four counts—true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). This calculator turns those counts into common evaluation metrics (accuracy, precision, recall, F1, and related measures) so you can understand performance beyond a single headline number.
Tip: These fields are counts (non‑negative integers). If you’re working from a dataset, they should sum to the total number of evaluated examples.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
Let N be the total number of samples: N = TP + FP + TN + FN.
MathML (example: precision):
Accuracy is the fraction of all predictions that are correct. It’s most informative when classes are reasonably balanced and misclassification costs are similar. With heavy class imbalance, accuracy can look “high” even when the model is poor at finding the minority class.
Precision answers: “When the model predicts positive, how often is it right?” Prioritize precision when false positives are costly (e.g., incorrectly flagging legitimate transactions as fraud).
Recall answers: “Of all actual positives, how many did we catch?” Prioritize recall when false negatives are costly (e.g., missing a disease case).
F1 balances precision and recall via the harmonic mean. It drops sharply if either precision or recall is low, so it’s useful when you want a single number and care about both error types for the positive class.
Specificity measures how well the model avoids false alarms among actual negatives. It’s especially relevant in screening settings where the negative class is large and you want to control false positives.
Suppose you are evaluating a spam filter. Out of 95 actual spam messages, it correctly flags 90 and misses 5. Out of 200 legitimate emails, it incorrectly flags 15 as spam and correctly leaves 185 alone:
Total N = 90 + 5 + 15 + 185 = 295.
Interpretation: the filter catches most spam (high recall), but some “spam” predictions are false alarms (precision lower than recall). Whether that’s acceptable depends on user tolerance for wrongly flagged legitimate emails.
| Metric | Best for | Penalizes | Can be misleading when… |
|---|---|---|---|
| Accuracy | Balanced classes; equal error costs | All errors equally | Classes are imbalanced |
| Precision (PPV) | Reducing false positives | False alarms (FP) | Positive predictions are rare or threshold changes greatly |
| Recall (Sensitivity) | Reducing false negatives | Misses (FN) | You ignore the cost of false positives |
| F1 | Balancing precision and recall | Imbalance between precision and recall | True negatives matter a lot (F1 ignores TN directly) |
| Specificity (TNR) | Controlling false positives among negatives | False positives (FP) | You mainly care about catching positives (recall) |
| Balanced Accuracy | Imbalanced classes | Low TPR or low TNR | Different error costs require weighting |
The positive class is the outcome you’re trying to detect (e.g., “fraud”, “spam”, “disease present”). Metrics like precision, recall, and F1 are defined around this class.
With imbalanced data, a model can predict the majority class most of the time and still achieve high accuracy while performing poorly on the minority (often more important) class.
Precision is undefined because you are dividing by zero. In practice you may see “N/A” or 0, but the key takeaway is that the model never predicts positive at the chosen threshold.
They are the same concept in binary classification: TP / (TP + FN). “Sensitivity” is common in medicine; “recall” is common in information retrieval and ML.
No. F1 focuses on the positive class and ignores true negatives directly. If true negatives matter a lot (e.g., you must avoid false alarms across a huge negative population), metrics like specificity, balanced accuracy, or PR/ROC analysis may be more appropriate.