Published on

(Preview) Cross Entropy From First Principles

Authors

Intro

Cross entropy and Kullback-Leibler (KL) divergence are central concepts in machine learning that quantify how to measure and compare probability distributions. They appear in intro to ML courses, many research papers, AI textbooks, ML frameworks, and interview questions (e.g., OpenAI, Anthropic, Google, Meta, etc.). However, they're often introduced without backing principles and sometimes with hand-waving explanations. In this article, we're going to derive cross entropy and KL divergence from first principles so that readers will more deeply understand them. Along the way, we'll also cover a very brief introduction to language models so that you can start to build up an intuitive sense of how large language models like ChatGPT, Claude, Gemini, and Llama work (the full explanation for LLMs will be in a future post).

Table of Contents

Tips for Reading This Article

  • You might want to break up this article into different reading sessions, but the hope is by the end of it, you will have a strong intuitive sense for one of the most important concepts in machine learning.
  • There are many interactive widgets in this article. Play with them to build intuition for the concepts!

Language Models & Probability Distributions 101

Suppose you see the phrase below:

phrase

What letter in the English alphabet would complete that phrase? English speakers know it's the letter y – "an apple a day keeps the doctor away".

You believe with high probability that the next letter is y because you have a language model in your head that has read many many English sentences throughout your lifetime, and you have some intuitive sense for what letter or word should come next.

Given some input text, a language model does not directly spit out the next letter. Instead, it produces a probability distribution over the English letters of the alphabet. What is a probability distribution you ask? A probability distribution tells us how likely each of the 26 characters in the English alphabet is to be the next character, based on all the input. For example, based on the input "an apple a day keeps the doctor awa", how likely is the next letter e? How about y?

Comparing Ideal and Predicted Distributions

Below are two probability distributions. The first is the ideal distribution you would like your language model to produce after seeing "an apple a day keeps the doctor awa". In other words, you want your model to predict with 100% certainty that the next letter is y.

phrase

Below is the second probability distribution. It is the predicted distribution after feeding "an apple a day keeps the doctor awa" into the language model.

phrase

We see that the model believes there's a probability of 0.13, or 13% chance, that the next letter should be e. We see that the probability it should be y is only 0.02, or 2%.

Clearly, the language model is predicting y with 100% certainty. How do we tell the language model, "hey, your predicted probability distribution is wrong! It should associate y with 100% probability"? It'd be wonderful if we could take the two distributions and feed both into some magic function that tells us how different they are. That's what Kullback-Leibler divergence measures. This article will demystify that magic.

phrase

In the rest of this article, we'll learn about information theory concepts like surprise, entropy, cross entropy, and Kullback-Leibler divergence and apply them to machine learning, generative AI, and language models.

Subscribe for updates on this article and access to a free Google Colab notebook that walks you through step by step how to build an image search engine. Unsubscribe any time.

Surprise

To arrive at the concept of KL divergence, first we must understand one of its building blocks: surprise. We'll try to interpret this quantity from two different angles. The good news is once you understand surprise, understanding entropy, cross entropy, and KL divergence will be much easier. And these concepts will lead you to understand how many modern machine learning models and LLMs train and learn from data.

First Interpretation

Surprise, like its English language equivalent, shows up in many situations. Let's say you showed up at the North Pole and measured the temperature to be 0°C. You wouldn't be too surprised. A temperature of 0°C has a high probability at the North Pole, so it matches our expectations.

What if the temperature were 35°C? That would be very surprising! This temperature has an extremely low probability at the North Pole.

What if the temperature were 10°C? That would also be surprising, but not as surprising as 35°C. A 10°C reading has a low probability, but not as low as 35°C.

How can we quantify how much surprise we would feel for a given temperature in the North Pole? That's what surprise measures. The higher the temperature, the lower its probability, and the more surprised you would be.

Second Interpretation

This leads us to a deeper way of thinking about surprise. If you were to show up at the North Pole and measured the temperature to be 0°C, you wouldn't be too surprised. The temperature is what you would have expected. In other words, you wouldn't have learned anything new. But if the temperature were 35°C, that would alter your perception of the North Pole. You would realize your previous mental model of the climate in the North Pole was inaccurate, and you would have to update it with new information you've gained from observing the unexpected 35°C.

Low surprise - expected temperatureHigh surprise - unexpected temperature

We can interpret surprise as a measure of information gained from observing an event. The lower the probability of an event, the more information we've gained because we would have to update our mental model. The higher the probability of an event, the less information we've gained because we wouldn't have learned too much.

Mathematically, surprise has a precise relationship to probability that we'll explore next, setting the foundation for understanding entropy and the other concepts that follow.

Relating Surprise to Probability

Intuitively, you might be getting a sense that the lower the probability associated with an event (e.g., 35°C in the North Pole, winning the lottery, etc.), the higher the surprise. Put another way, there's an inverse relationship between probability and surprise. Drag the circle in the interactive playground below to build intuition for this inverse relationship (note that the unit for surprise is bits, just like KL divergence. We'll dive into this relationship later).

📱 Visit this article on desktop to interact with the learning widget

First Attempt at Math Definition of Surprise

Because there is an inverse relationship between surprise and probability (p\color{cyan}{p}), let's start off with a simple equation:

surprise=1p\text{\color{magenta}{surprise}} = \frac{1}{\color{cyan}{p}}

Let's plug in various values for p\color{cyan}{p}:

Surprise\color{magenta}{\text{Surprise}}Probability\color{cyan}{\text{Probability}} (p\color{cyan}{p})Comment\text{Comment}
1000.01Extremely surprising
100.1Very surprising
40.25Higher surprise
20.5Moderate surprise
11.0No surprise for certain events

Notice that low probabilities are associated with high values of surprise. In fact, it seems surprise is unbounded. After all, the rarer an event, the higher the surprise.

However, you might have noticed an event with probability 1.0 has a surprise value of 1. This seems arbitrary. Why would p = 1 map to surprise = 1? If an event occurs with 100% certainty, intuitively there should be no surprise associated with it. In other words, p = 1 should map to a surprise value of 0.

Second Attempt at Math Definition of Surprise

Therefore, we need to modify our definition of surprise so that 100% events have zero surprise.

To summarize, we're looking for a mathematical function that satisfies the following properties:

  • Decreases in surprise as probability increases from 0 to 1
  • Evaluates to zero when the input probability is 1 – events that occur with 100% certainty have zero surprise.

One mathematical function that suits this is the logarithm.

surprise=log2(1probability)\text{\color{magenta}{surprise}} = \log_2\left(\frac{1}{\text{\color{cyan}{probability}}}\right)

Let's recalculate the surprise values using our new logarithmic definition:

Surprise\color{magenta}{\text{Surprise}} (bits)Probability\color{cyan}{\text{Probability}} (p\color{cyan}{p})Comment\text{Comment}
6.640.01Extremely surprising
3.320.1Very surprising
2.000.25Higher surprise
1.000.5Moderate surprise
0.001.0No surprise for certain events

Now we see that events with 100% probability correctly have zero surprise, which aligns with our intuition!

Why log base 2? The choice of logarithm base 2 (rather than base 10 or natural log) comes from information theory, where we measure information in "bits." While the deep mathematical justification is fascinating, it requires diving into topics beyond our current scope. For those curious about the rigorous foundations, I recommend Information Theory: A Tutorial Introduction, https://amzn.to/3Ypwrez.

Recap on Surprise

Congratulations! We just covered a lot of theory in a short span of reading. There are entire college level courses on this topic, so don't be worried if you're still trying to digest all this theory.

To recap, surprise is a mathematical way to quantify how unexpected an event is when it occurs. Events are associated with probabilities: rare events produce more surprise, while common events generate relatively little. Surprise can also be seen as the amount of information we gain by observing the outcome — the more unexpected the event, the more information it gives us.

Quiz