Intro

Cross entropy and Kullback-Leibler (KL) divergence are central concepts in machine learning that quantify how to measure and compare probability distributions. They appear in intro to ML courses, many research papers, AI textbooks, ML frameworks, and interview questions (e.g., OpenAI, Anthropic, Google, Meta, etc.). However, they're often introduced without backing principles and sometimes with hand-waving explanations. In this article, we're going to derive cross entropy and KL divergence from first principles so that readers will more deeply understand them. Along the way, we'll also cover a very brief introduction to language models so that you can start to build up an intuitive sense of how large language models like ChatGPT, Claude, Gemini, and Llama work (the full explanation for LLMs will be in a future post).

Table of Contents

Tips for Reading This Article
Language Models & Probability Distributions 101
Comparing Ideal and Predicted Distributions
Surprise

Tips for Reading This Article

You might want to break up this article into different reading sessions, but the hope is by the end of it, you will have a strong intuitive sense for one of the most important concepts in machine learning.
There are many interactive widgets in this article. Play with them to build intuition for the concepts!

Language Models & Probability Distributions 101

Suppose you see the phrase below:

What letter in the English alphabet would complete that phrase? English speakers know it's the letter y – "an apple a day keeps the doctor away".

You believe with high probability that the next letter is y because you have a language model in your head that has read many many English sentences throughout your lifetime, and you have some intuitive sense for what letter or word should come next.

Given some input text, a language model does not directly spit out the next letter. Instead, it produces a probability distribution over the English letters of the alphabet. What is a probability distribution you ask? A probability distribution tells us how likely each of the 26 characters in the English alphabet is to be the next character, based on all the input. For example, based on the input "an apple a day keeps the doctor awa", how likely is the next letter e? How about y?

Simulating a Language Model

Let's build intuition for how a language model operates. In the textbox below, begin typing "an apple a day keeps the doctor awa" letter by letter slowly. Just below that is a histogram. Notice how the probability distribution changes as you type. (Also note if the probability is higher than 0.14, the bar graph will cut off, but you can hover over the bar to see the probability. This is for a better viewing experience.)

Exercise: What's the probability associated with y after you have typed in the entirety of "an apple a day keeps the doctor awa"? (Answer below.)

Type text to see next letter predictions:

Type at least 3 characters to see context-aware predictions

Probability Distribution of Next Letter

Probability

0.000.020.040.060.080.100.120.14

Letters

Context:

Top predictions:

📱 Visit this article on desktop to interact with the learning widget

Answer: The probability associated with y is 0.7058, or 70.58%. This means after seeing "an apple a day keeps the doctor awa", the language model thinks that the probability of the next letter being y is 0.7058. Notice how among all the letters, y has the highest probability associated with it.

Comparing Ideal and Predicted Distributions

Below are two probability distributions. The first is the ideal distribution you would like your language model to produce after seeing "an apple a day keeps the doctor awa". In other words, you want your model to predict with 100% certainty that the next letter is y.

Below is the second probability distribution. It is the predicted distribution after feeding "an apple a day keeps the doctor awa" into the language model.

We see that the model believes there's a probability of 0.13, or 13% chance, that the next letter should be e. We see that the probability it should be y is only 0.02, or 2%.

Clearly, the language model is predicting y with 100% certainty. How do we tell the language model, "hey, your predicted probability distribution is wrong! It should associate y with 100% probability"? It'd be wonderful if we could take the two distributions and feed both into some magic function that tells us how different they are. That's what Kullback-Leibler divergence measures. This article will demystify that magic.

Comparing Two Probability Distributions

Kullback-Leibler (KL) divergence is simply a number that tells you how different two probability distributions are. For reasons we'll dive into later, the unit for this number is bits. In the textbox below, type in an apple a day keeps the doctor awa slowly. a language model will predict the probability of the next letter. Meanwhile, there is a second ideal distribution based on the training data. Together, these two probability distributions will be compared by computing the KL divergence.

Type text to see KL divergence between predicted and true distributions:

Type at least 3 characters to see context-aware predictions

Predicted Distribution

Probability

0.00.20.40.60.81.0

Letters

True Distribution

Probability

0.00.20.40.60.81.0

Letters

Kullback-Leibler Divergence

---- bits

Start typing to see how well the model predicts the next letter

Context:

True next letter:

Model's top prediction:

📱 Visit this article on desktop to interact with the learning widget

In the rest of this article, we'll learn about information theory concepts like surprise, entropy, cross entropy, and Kullback-Leibler divergence and apply them to machine learning, generative AI, and language models.

Subscribe for updates on this article and access to a free Google Colab notebook that walks you through step by step how to build an image search engine. Unsubscribe any time.

Surprise

To arrive at the concept of KL divergence, first we must understand one of its building blocks: surprise. We'll try to interpret this quantity from two different angles. The good news is once you understand surprise, understanding entropy, cross entropy, and KL divergence will be much easier. And these concepts will lead you to understand how many modern machine learning models and LLMs train and learn from data.

First Interpretation

Surprise, like its English language equivalent, shows up in many situations. Let's say you showed up at the North Pole and measured the temperature to be 0°C. You wouldn't be too surprised. A temperature of 0°C has a high probability at the North Pole, so it matches our expectations.

What if the temperature were 35°C? That would be very surprising! This temperature has an extremely low probability at the North Pole.

What if the temperature were 10°C? That would also be surprising, but not as surprising as 35°C. A 10°C reading has a low probability, but not as low as 35°C.

How can we quantify how much surprise we would feel for a given temperature in the North Pole? That's what surprise measures. The higher the temperature, the lower its probability, and the more surprised you would be.

Second Interpretation

This leads us to a deeper way of thinking about surprise. If you were to show up at the North Pole and measured the temperature to be 0°C, you wouldn't be too surprised. The temperature is what you would have expected. In other words, you wouldn't have learned anything new. But if the temperature were 35°C, that would alter your perception of the North Pole. You would realize your previous mental model of the climate in the North Pole was inaccurate, and you would have to update it with new information you've gained from observing the unexpected 35°C.

We can interpret surprise as a measure of information gained from observing an event. The lower the probability of an event, the more information we've gained because we would have to update our mental model. The higher the probability of an event, the less information we've gained because we wouldn't have learned too much.

Mathematically, surprise has a precise relationship to probability that we'll explore next, setting the foundation for understanding entropy and the other concepts that follow.

Relating Surprise to Probability

Intuitively, you might be getting a sense that the lower the probability associated with an event (e.g., 35°C in the North Pole, winning the lottery, etc.), the higher the surprise. Put another way, there's an inverse relationship between probability and surprise. Drag the circle in the interactive playground below to build intuition for this inverse relationship (note that the unit for surprise is bits, just like KL divergence. We'll dive into this relationship later).

Probability:0.500

Surprise:1.000bits

Moderate surprise for events with moderate probability

📱 Visit this article on desktop to interact with the learning widget

First Attempt at Math Definition of Surprise

Because there is an inverse relationship between surprise and probability ( $\color{cyan}{p}$ ), let's start off with a simple equation:

\text{\color{magenta}{surprise}} = \frac{1}{\color{cyan}{p}}

Let's plug in various values for $\color{cyan}{p}$ :

$\color{magenta}{\text{Surprise}}$	$\color{cyan}{\text{Probability}}$ ( $\color{cyan}{p}$ )	$\text{Comment}$
100	0.01	Extremely surprising
10	0.1	Very surprising
4	0.25	Higher surprise
2	0.5	Moderate surprise
1	1.0	No surprise for certain events

Notice that low probabilities are associated with high values of surprise. In fact, it seems surprise is unbounded. After all, the rarer an event, the higher the surprise.

However, you might have noticed an event with probability 1.0 has a surprise value of 1. This seems arbitrary. Why would p = 1 map to surprise = 1? If an event occurs with 100% certainty, intuitively there should be no surprise associated with it. In other words, p = 1 should map to a surprise value of 0.

Second Attempt at Math Definition of Surprise

Therefore, we need to modify our definition of surprise so that 100% events have zero surprise.

To summarize, we're looking for a mathematical function that satisfies the following properties:

Decreases in surprise as probability increases from 0 to 1
Evaluates to zero when the input probability is 1 – events that occur with 100% certainty have zero surprise.

One mathematical function that suits this is the logarithm.

\text{\color{magenta}{surprise}} = \log_2\left(\frac{1}{\text{\color{cyan}{probability}}}\right)

Let's recalculate the surprise values using our new logarithmic definition:

$\color{magenta}{\text{Surprise}}$ (bits)	$\color{cyan}{\text{Probability}}$ ( $\color{cyan}{p}$ )	$\text{Comment}$
6.64	0.01	Extremely surprising
3.32	0.1	Very surprising
2.00	0.25	Higher surprise
1.00	0.5	Moderate surprise
0.00	1.0	No surprise for certain events

Now we see that events with 100% probability correctly have zero surprise, which aligns with our intuition!

Why log base 2? The choice of logarithm base 2 (rather than base 10 or natural log) comes from information theory, where we measure information in "bits." While the deep mathematical justification is fascinating, it requires diving into topics beyond our current scope. For those curious about the rigorous foundations, I recommend Information Theory: A Tutorial Introduction, https://amzn.to/3Ypwrez.

Recap on Surprise

Congratulations! We just covered a lot of theory in a short span of reading. There are entire college level courses on this topic, so don't be worried if you're still trying to digest all this theory.

To recap, surprise is a mathematical way to quantify how unexpected an event is when it occurs. Events are associated with probabilities: rare events produce more surprise, while common events generate relatively little. Surprise can also be seen as the amount of information we gain by observing the outcome — the more unexpected the event, the more information it gives us.

Quiz

Question 2

Assume you're working with a language model in Python that only predicts the probability of the next vowel: a, e, i, o, u.

>> model = Model()
>> input_text = "appl"
>> probs = model.predict_next_vowel(input_text)
>> print(probs)
{'a': 0.1, 'e': 0.7, 'i': 0.05, 'o': 0.05, 'u': 0.1}

Write a Python program to compute the surprise associated with each of the probabilities. What are the surprise values for each of the vowels above? Round your answers to the nearest hundredth.

import math
for vowel, p in probs.items():
    surprise = math.log(1/p, 2)
    print(f'{vowel}: {surprise:.2f} bits')

a: 3.32 bits
e: 0.51 bits
i: 4.32 bits
o: 4.32 bits
u: 3.32 bits

Question 3

Assume you have trained 3 large language models (LLMs). Model A is trained only on Spanish textbooks. Model B is trained only on English Wikipedia articles. Model C is trained only on English newspapers. Now you feed the phrase "an apple a day keeps the doctor awa" through the 3 models and produce 3 probability distributions. Which pair of models will likely produce the lowest KL divergence?

1) Models A and B

2) Models A and C

3) Models B and C

Answer: (3) Models B and C

Both models B and C are trained on English text, so they'll encounter similar vocabulary and language patterns. Model A would never encounter the English phrase "an apple a day keeps the doctor awa", so it likely won't produce a probability distribution for the next letter that's more similar to B and C's probability distributions.