Confusion Matrix Metrics Calculator
A confusion matrix summarizes a binary classifier’s outcomes into four counts—true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). This calculator turns those counts into common evaluation metrics (accuracy, precision, recall, F1, and related measures) so you can understand performance beyond a single headline number.
How to use this calculator
- Choose which label you are treating as the positive class (for example, “spam”, “fraud”, or “disease present”).
- Enter TP: positives correctly predicted as positive.
- Enter FP: negatives incorrectly predicted as positive (false alarms).
- Enter TN: negatives correctly predicted as negative.
- Enter FN: positives incorrectly predicted as negative (misses).
- Click Calculate to compute the metrics.
Tip: These fields are counts (non‑negative integers). If you’re working from a dataset, they should sum to the total number of evaluated examples.
Confusion matrix structure
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP | FN |
| Actual Negative | FP | TN |
Formulas
Let N be the total number of samples: N = TP + FP + TN + FN.
Core metrics
- Accuracy: (TP + TN) / N
- Precision (Positive Predictive Value, PPV): TP / (TP + FP)
- Recall (Sensitivity, True Positive Rate, TPR): TP / (TP + FN)
- F1 score: 2·(Precision·Recall) / (Precision + Recall)
MathML (example: precision):
Additional useful metrics (common in practice)
- Specificity (True Negative Rate, TNR): TN / (TN + FP)
- False Positive Rate (FPR): FP / (FP + TN) = 1 − Specificity
- False Negative Rate (FNR): FN / (FN + TP) = 1 − Recall
- Negative Predictive Value (NPV): TN / (TN + FN)
- Balanced accuracy: (Recall + Specificity) / 2
- Prevalence (actual positive rate): (TP + FN) / N
How to interpret the results
Accuracy
Accuracy is the fraction of all predictions that are correct. It’s most informative when classes are reasonably balanced and misclassification costs are similar. With heavy class imbalance, accuracy can look “high” even when the model is poor at finding the minority class.
Precision
Precision answers: “When the model predicts positive, how often is it right?” Prioritize precision when false positives are costly (e.g., incorrectly flagging legitimate transactions as fraud).
Recall (Sensitivity)
Recall answers: “Of all actual positives, how many did we catch?” Prioritize recall when false negatives are costly (e.g., missing a disease case).
F1 score
F1 balances precision and recall via the harmonic mean. It drops sharply if either precision or recall is low, so it’s useful when you want a single number and care about both error types for the positive class.
Specificity
Specificity measures how well the model avoids false alarms among actual negatives. It’s especially relevant in screening settings where the negative class is large and you want to control false positives.
Worked example
Suppose you are evaluating a spam filter. Out of 95 actual spam messages, it correctly flags 90 and misses 5. Out of 200 legitimate emails, it incorrectly flags 15 as spam and correctly leaves 185 alone:
- TP = 90
- FN = 5
- FP = 15
- TN = 185
Total N = 90 + 5 + 15 + 185 = 295.
- Accuracy = (90 + 185) / 295 ≈ 0.9322
- Precision = 90 / (90 + 15) ≈ 0.8571
- Recall = 90 / (90 + 5) ≈ 0.9474
- F1 ≈ 2·(0.8571·0.9474)/(0.8571+0.9474) ≈ 0.9000
Interpretation: the filter catches most spam (high recall), but some “spam” predictions are false alarms (precision lower than recall). Whether that’s acceptable depends on user tolerance for wrongly flagged legitimate emails.
Metric comparison (what each one emphasizes)
| Metric | Best for | Penalizes | Can be misleading when… |
|---|---|---|---|
| Accuracy | Balanced classes; equal error costs | All errors equally | Classes are imbalanced |
| Precision (PPV) | Reducing false positives | False alarms (FP) | Positive predictions are rare or threshold changes greatly |
| Recall (Sensitivity) | Reducing false negatives | Misses (FN) | You ignore the cost of false positives |
| F1 | Balancing precision and recall | Imbalance between precision and recall | True negatives matter a lot (F1 ignores TN directly) |
| Specificity (TNR) | Controlling false positives among negatives | False positives (FP) | You mainly care about catching positives (recall) |
| Balanced Accuracy | Imbalanced classes | Low TPR or low TNR | Different error costs require weighting |
Assumptions and limitations
- Binary classification: These definitions assume two classes. Multi-class problems typically compute per-class metrics (one-vs-rest) and then average (macro/micro/weighted).
- Positive class matters: All “precision/recall/F1” values are relative to whichever label you treat as positive. Swapping which class is positive changes these metrics.
- Zero-denominator cases: If a denominator is zero (e.g., TP+FP = 0 meaning there are no predicted positives), the corresponding metric is undefined. Many tools report it as 0 or “N/A”; interpret carefully and consider adjusting thresholds or collecting more data.
- Threshold-dependent: For probabilistic models, TP/FP/TN/FN depend on the decision threshold. Comparing models fairly often requires looking across thresholds (ROC/PR curves), not only one point.
- Base-rate effects: Precision/NPV depend strongly on prevalence. A model can have good sensitivity/specificity but still yield low precision when prevalence is very low.
- Context-specific costs: Metrics don’t encode your real-world costs. If false negatives and false positives have very different consequences, consider cost-sensitive evaluation.
FAQ
What is the “positive” class?
The positive class is the outcome you’re trying to detect (e.g., “fraud”, “spam”, “disease present”). Metrics like precision, recall, and F1 are defined around this class.
Introduction: Why can accuracy be misleading?
With imbalanced data, a model can predict the majority class most of the time and still achieve high accuracy while performing poorly on the minority (often more important) class.
What if there are no predicted positives (TP + FP = 0)?
Precision is undefined because you are dividing by zero. In practice you may see “N/A” or 0, but the key takeaway is that the model never predicts positive at the chosen threshold.
What’s the difference between recall and sensitivity?
They are the same concept in binary classification: TP / (TP + FN). “Sensitivity” is common in medicine; “recall” is common in information retrieval and ML.
Is F1 always better than accuracy?
No. F1 focuses on the positive class and ignores true negatives directly. If true negatives matter a lot (e.g., you must avoid false alarms across a huge negative population), metrics like specificity, balanced accuracy, or PR/ROC analysis may be more appropriate.
Arcade Mini-Game: Confusion Matrix Metrics Calculator Calibration Run
Use this quick arcade run to practice separating useful scenario inputs from common planning mistakes before you rely on the calculator output.
Start the game, then use your pointer or arrow keys to catch useful inputs and avoid bad assumptions.
