A confusion matrix summarizes a binary classifier’s outcomes into four counts—true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). This calculator turns those counts into common evaluation metrics (accuracy, precision, recall, F1, and related measures) so you can understand performance beyond a single headline number.

How to use this calculator

Choose which label you are treating as the positive class (for example, “spam”, “fraud”, or “disease present”).
Enter TP: positives correctly predicted as positive.
Enter FP: negatives incorrectly predicted as positive (false alarms).
Enter TN: negatives correctly predicted as negative.
Enter FN: positives incorrectly predicted as negative (misses).
Click Calculate to compute the metrics.

Tip: These fields are counts (non‑negative integers). If you’re working from a dataset, they should sum to the total number of evaluated examples.

Confusion matrix structure

	Predicted Positive	Predicted Negative
Actual Positive	TP	FN
Actual Negative	FP	TN

Formulas

Let N be the total number of samples: N = TP + FP + TN + FN.

Core metrics

Accuracy: (TP + TN) / N
Precision (Positive Predictive Value, PPV): TP / (TP + FP)
Recall (Sensitivity, True Positive Rate, TPR): TP / (TP + FN)
F1 score: 2·(Precision·Recall) / (Precision + Recall)

MathML (example: precision):

Precision = \frac{TP}{TP + FP}

Additional useful metrics (common in practice)

Specificity (True Negative Rate, TNR): TN / (TN + FP)
False Positive Rate (FPR): FP / (FP + TN) = 1 − Specificity
False Negative Rate (FNR): FN / (FN + TP) = 1 − Recall
Negative Predictive Value (NPV): TN / (TN + FN)
Balanced accuracy: (Recall + Specificity) / 2
Prevalence (actual positive rate): (TP + FN) / N

How to interpret the results

Accuracy

Accuracy is the fraction of all predictions that are correct. It’s most informative when classes are reasonably balanced and misclassification costs are similar. With heavy class imbalance, accuracy can look “high” even when the model is poor at finding the minority class.

Precision

Precision answers: “When the model predicts positive, how often is it right?” Prioritize precision when false positives are costly (e.g., incorrectly flagging legitimate transactions as fraud).

Recall (Sensitivity)

Recall answers: “Of all actual positives, how many did we catch?” Prioritize recall when false negatives are costly (e.g., missing a disease case).

F1 score

F1 balances precision and recall via the harmonic mean. It drops sharply if either precision or recall is low, so it’s useful when you want a single number and care about both error types for the positive class.

Specificity

Specificity measures how well the model avoids false alarms among actual negatives. It’s especially relevant in screening settings where the negative class is large and you want to control false positives.

Worked example

Suppose you are evaluating a spam filter. Out of 95 actual spam messages, it correctly flags 90 and misses 5. Out of 200 legitimate emails, it incorrectly flags 15 as spam and correctly leaves 185 alone:

TP = 90
FN = 5
FP = 15
TN = 185

Total N = 90 + 5 + 15 + 185 = 295.

Accuracy = (90 + 185) / 295 ≈ 0.9322
Precision = 90 / (90 + 15) ≈ 0.8571
Recall = 90 / (90 + 5) ≈ 0.9474
F1 ≈ 2·(0.8571·0.9474)/(0.8571+0.9474) ≈ 0.9000

Interpretation: the filter catches most spam (high recall), but some “spam” predictions are false alarms (precision lower than recall). Whether that’s acceptable depends on user tolerance for wrongly flagged legitimate emails.

Metric comparison (what each one emphasizes)

Metric	Best for	Penalizes	Can be misleading when…
Accuracy	Balanced classes; equal error costs	All errors equally	Classes are imbalanced
Precision (PPV)	Reducing false positives	False alarms (FP)	Positive predictions are rare or threshold changes greatly
Recall (Sensitivity)	Reducing false negatives	Misses (FN)	You ignore the cost of false positives
F1	Balancing precision and recall	Imbalance between precision and recall	True negatives matter a lot (F1 ignores TN directly)
Specificity (TNR)	Controlling false positives among negatives	False positives (FP)	You mainly care about catching positives (recall)
Balanced Accuracy	Imbalanced classes	Low TPR or low TNR	Different error costs require weighting

Assumptions and limitations

Binary classification: These definitions assume two classes. Multi-class problems typically compute per-class metrics (one-vs-rest) and then average (macro/micro/weighted).
Positive class matters: All “precision/recall/F1” values are relative to whichever label you treat as positive. Swapping which class is positive changes these metrics.
Zero-denominator cases: If a denominator is zero (e.g., TP+FP = 0 meaning there are no predicted positives), the corresponding metric is undefined. Many tools report it as 0 or “N/A”; interpret carefully and consider adjusting thresholds or collecting more data.
Threshold-dependent: For probabilistic models, TP/FP/TN/FN depend on the decision threshold. Comparing models fairly often requires looking across thresholds (ROC/PR curves), not only one point.
Base-rate effects: Precision/NPV depend strongly on prevalence. A model can have good sensitivity/specificity but still yield low precision when prevalence is very low.
Context-specific costs: Metrics don’t encode your real-world costs. If false negatives and false positives have very different consequences, consider cost-sensitive evaluation.

FAQ

What is the “positive” class?

The positive class is the outcome you’re trying to detect (e.g., “fraud”, “spam”, “disease present”). Metrics like precision, recall, and F1 are defined around this class.

Why can accuracy be misleading?

With imbalanced data, a model can predict the majority class most of the time and still achieve high accuracy while performing poorly on the minority (often more important) class.

What if there are no predicted positives (TP + FP = 0)?

Precision is undefined because you are dividing by zero. In practice you may see “N/A” or 0, but the key takeaway is that the model never predicts positive at the chosen threshold.

What’s the difference between recall and sensitivity?

They are the same concept in binary classification: TP / (TP + FN). “Sensitivity” is common in medicine; “recall” is common in information retrieval and ML.

Is F1 always better than accuracy?

No. F1 focuses on the positive class and ignores true negatives directly. If true negatives matter a lot (e.g., you must avoid false alarms across a huge negative population), metrics like specificity, balanced accuracy, or PR/ROC analysis may be more appropriate.

Confusion Matrix Metrics Calculator

How to use this calculator

Confusion matrix structure

Formulas

Core metrics

Additional useful metrics (common in practice)

How to interpret the results

Accuracy

Precision

Recall (Sensitivity)

F1 score

Specificity

Worked example

Metric comparison (what each one emphasizes)

Assumptions and limitations

FAQ

What is the “positive” class?

Why can accuracy be misleading?

What if there are no predicted positives (TP + FP = 0)?

What’s the difference between recall and sensitivity?

Is F1 always better than accuracy?

Embed this calculator

Confusion Matrix Metrics Calculator

How to use this calculator

Confusion matrix structure

Formulas

Core metrics

Additional useful metrics (common in practice)

How to interpret the results

Accuracy

Precision

Recall (Sensitivity)

F1 score

Specificity

Worked example

Metric comparison (what each one emphasizes)

Assumptions and limitations

FAQ

What is the “positive” class?

Why can accuracy be misleading?

What if there are no predicted positives (TP + FP = 0)?

What’s the difference between recall and sensitivity?

Is F1 always better than accuracy?

Embed this calculator

Related Calculators

Algorithmic Fairness Metrics Calculator | Measure Model Bias Accurately

Biometric False Acceptance Risk Calculator

Product Recall Cost Exposure Calculator - Assess Recall Risk

Bayesian Probability Calculator - Update Beliefs with Evidence

AI Data Labeling Sample Size Calculator

Algorithmic Fairness Bias Metric Calculator