Kullback–Leibler (KL) divergence measures how one probability distribution diverges from a second, reference distribution. Given discrete distributions and over the same set, the KL divergence from to is defined as
Intuitively, the formula quantifies the extra information required to encode events sampled from if we use a code optimized for . When and are identical, the divergence is zero. As the distributions diverge, the value grows, reflecting the inefficiency of using to approximate .
KL divergence appears throughout machine learning and information theory. In variational inference, models minimize KL divergence to find a simpler approximate distribution that mimics a complicated posterior. In reinforcement learning, policies are often constrained by a maximum KL divergence from prior policies to ensure stable updates. The concept also helps compare language models, evaluate generative networks, and track training progress in classification problems.
Because KL divergence is asymmetric, generally differs from . This asymmetry underscores its interpretation as a measure of relative entropy—the expected extra message length when is encoded with 's code.
Enter probabilities for and separated by commas. Each list should contain the same number of values and sum to 1. The script normalizes them if necessary. When you press the compute button, it iterates through the arrays, sums , and displays the result. Probabilities equal to zero contribute nothing because the limit of approaches zero as vanishes.
Suppose assigns probabilities while assigns . Plugging these into the formula yields
Small values indicate the distributions are close, while large values highlight stark differences.
By experimenting with the inputs, you can see how skewing probability mass increases the divergence. Extreme mismatches quickly produce large values. This sensitivity to improbable events is a hallmark of KL divergence and influences its use in robust statistics. In practice, KL divergence informs algorithms ranging from expectation-maximization to reinforcement learning policy updates, showcasing its broad relevance.
Several important properties make KL divergence central to information theory. The value is always non‑negative thanks to Gibbs’ inequality and becomes zero only when the two distributions are identical. Because the logarithm appears inside the summation, very small probabilities can have an outsized impact; assigning positive mass to an event that never occurs under the reference distribution incurs an infinite cost. KL divergence is also asymmetric and fails to satisfy the triangle inequality, so it is not a true metric despite behaving like a distance in some contexts.
One intuitive interpretation comes from coding theory. If you encode symbols produced by using a code optimized for , each symbol on average requires extra nats or bits depending on the logarithm base. The total expected message length equals the entropy of plus this divergence, linking the concept closely to cross‑entropy.
Our calculator also reports the cross‑entropy , which equals . Cross‑entropy measures the expected number of code units needed when events drawn from are encoded with probabilities from . To offer a symmetric alternative, the script computes the Jensen–Shannon (JS) divergence, which averages the KL divergence in both directions after forming the midpoint distribution =
Consider
Zero probabilities cause complications, so the calculator normalizes inputs and ensures all entries are non‑negative. If any location has
The drop‑down menu chooses between natural and base‑2 logarithms. The latter expresses results in bits, which is handy for information‑theoretic interpretations. After computation, a Copy Result button appears so you can paste the output into reports or notebooks.
KL divergence underlies numerous algorithms. Minimizing
Compute the divergence and curl of a vector field at a point.
Compute the flux of a vector field through a rectangular box and compare it with the triple integral of its divergence to illustrate Gauss's divergence theorem.
Compute Shannon entropy from probability values to quantify uncertainty in data.