Confusion Matrix Metrics Calculator

JJ Ben-Joseph headshot JJ Ben-Joseph

A confusion matrix summarizes a binary classifier’s outcomes into four counts—true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). This calculator turns those counts into common evaluation metrics (accuracy, precision, recall, F1, and related measures) so you can understand performance beyond a single headline number.

How to use this calculator

  1. Choose which label you are treating as the positive class (for example, “spam”, “fraud”, or “disease present”).
  2. Enter TP: positives correctly predicted as positive.
  3. Enter FP: negatives incorrectly predicted as positive (false alarms).
  4. Enter TN: negatives correctly predicted as negative.
  5. Enter FN: positives incorrectly predicted as negative (misses).
  6. Click Calculate to compute the metrics.

Tip: These fields are counts (non‑negative integers). If you’re working from a dataset, they should sum to the total number of evaluated examples.

Confusion matrix structure

Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN

Formulas

Let N be the total number of samples: N = TP + FP + TN + FN.

Core metrics

MathML (example: precision):

Precision = TP TP + FP

Additional useful metrics (common in practice)

How to interpret the results

Accuracy

Accuracy is the fraction of all predictions that are correct. It’s most informative when classes are reasonably balanced and misclassification costs are similar. With heavy class imbalance, accuracy can look “high” even when the model is poor at finding the minority class.

Precision

Precision answers: “When the model predicts positive, how often is it right?” Prioritize precision when false positives are costly (e.g., incorrectly flagging legitimate transactions as fraud).

Recall (Sensitivity)

Recall answers: “Of all actual positives, how many did we catch?” Prioritize recall when false negatives are costly (e.g., missing a disease case).

F1 score

F1 balances precision and recall via the harmonic mean. It drops sharply if either precision or recall is low, so it’s useful when you want a single number and care about both error types for the positive class.

Specificity

Specificity measures how well the model avoids false alarms among actual negatives. It’s especially relevant in screening settings where the negative class is large and you want to control false positives.

Worked example

Suppose you are evaluating a spam filter. Out of 95 actual spam messages, it correctly flags 90 and misses 5. Out of 200 legitimate emails, it incorrectly flags 15 as spam and correctly leaves 185 alone:

Total N = 90 + 5 + 15 + 185 = 295.

Interpretation: the filter catches most spam (high recall), but some “spam” predictions are false alarms (precision lower than recall). Whether that’s acceptable depends on user tolerance for wrongly flagged legitimate emails.

Metric comparison (what each one emphasizes)

Metric Best for Penalizes Can be misleading when…
Accuracy Balanced classes; equal error costs All errors equally Classes are imbalanced
Precision (PPV) Reducing false positives False alarms (FP) Positive predictions are rare or threshold changes greatly
Recall (Sensitivity) Reducing false negatives Misses (FN) You ignore the cost of false positives
F1 Balancing precision and recall Imbalance between precision and recall True negatives matter a lot (F1 ignores TN directly)
Specificity (TNR) Controlling false positives among negatives False positives (FP) You mainly care about catching positives (recall)
Balanced Accuracy Imbalanced classes Low TPR or low TNR Different error costs require weighting

Assumptions and limitations

FAQ

What is the “positive” class?

The positive class is the outcome you’re trying to detect (e.g., “fraud”, “spam”, “disease present”). Metrics like precision, recall, and F1 are defined around this class.

Why can accuracy be misleading?

With imbalanced data, a model can predict the majority class most of the time and still achieve high accuracy while performing poorly on the minority (often more important) class.

What if there are no predicted positives (TP + FP = 0)?

Precision is undefined because you are dividing by zero. In practice you may see “N/A” or 0, but the key takeaway is that the model never predicts positive at the chosen threshold.

What’s the difference between recall and sensitivity?

They are the same concept in binary classification: TP / (TP + FN). “Sensitivity” is common in medicine; “recall” is common in information retrieval and ML.

Is F1 always better than accuracy?

No. F1 focuses on the positive class and ignores true negatives directly. If true negatives matter a lot (e.g., you must avoid false alarms across a huge negative population), metrics like specificity, balanced accuracy, or PR/ROC analysis may be more appropriate.

Enter confusion matrix counts to compute metrics.

Embed this calculator

Copy and paste the HTML below to add the Confusion Matrix Metrics Calculator - Evaluate Classification Models to your website.