Fundamentals of Information Theory

5 min readJul 12, 2024

In today’s world, understanding the flow and representation of information is key. Concepts from information theory, such as entropy, mutual information, and KL-divergence, offer deep insights into how information is quantified, shared, and compared. Before diving into formal definitions, let’s explore these concepts through a more intuitive, visual approach.

Information Theory and Bit Encoding

Think of information theory in terms of bit encodings. Imagine you’re a data engineer tasked with minimizing the cost of sending messages from server A to server B. The costs scale linearly with the average number of bits used to encode the message. Consider the message “MMiiississippiii,” which contains four distinct letters. Let’s count their occurrences: ‘M’ and ‘p’ appear twice, ‘s’ appears four times, and ‘i’ appears eight times.

Encoding Schemes

Breadth-First Encoding

One way to encode the message is to use equal-length codes for all letters:

‘M’ gets “00”
‘i’ gets “01”
‘s’ gets “10”
‘p’ gets “11”

This results in an average of 2 bits per letter, and the encoded message looks like this: “00011010011010011111010101000101”. Can we do better?

Depth-First Encoding

Now, let’s take advantage that some letters occur more frequently than others. The most frequent letters must get the shortest descriptors:

‘i’ gets “0”
‘s’ gets “01”
‘p’ gets “011”
‘M’ gets “0111”

This encoding results in a 2 bits smaller encoding than before: “011100101001010011011000011100”. Ever heard of Huffman encoding?

Huffman Encoding

Similar as depth-first encoding, we will take advantage of the fact that some letters occur more frequently. Additionally, we will do it even better by merging two least occurring groups of letters recursively and routing respectively. Using Huffman encoding:

‘i’ gets “0”
‘s’ gets “10”
‘p’ gets “110”
‘M’ gets “111”

This encoding scheme reduces the message length to: “011100101001010011011000011100”, which is even 2 bits shorter than the depth-first encoded message.

Entropy: Measuring Uncertainty

To understand why Huffman encoding is more efficient, let’s calculate the likelihood and the average number of bits (negative log-likelihood with base 2) of each letter:

Entropy reflects the average number of bits needed to encode a letter, also it is the sum of the number of encoding bits for all letters, weighted by their probability. It is defined as:

Entropy definition

Now, the entropy of the probability distribution looks like:

KL-Divergence as a measure of distance for probability distributions

Relative Entropy (KL-Divergence): Difference of Distributions

What if we used a different encoding scheme optimized for a different distribution? Suppose we have an optimal encoding for a string with the letter distribution:

Another probability distribution

Using this encoding for our original message, the average encoding length becomes:

Cross entropy of probability distributions

This is known as cross-entropy:

Cross entropy definition

Relative entropy, or KL-Divergence, measures the difference between cross-entropy and entropy:

KL-Divergence definition

In our case:

KL-Divergence

This indicates an increase in the average number of bits needed to represent the message when using a suboptimal encoding.

Mutual information expressed using (conditional) entropy

Mutual Information: Quantifying Shared Information

Now, let’s simplify the notations for random variables, assuming p is the only distribution:

Notation for random variables

Consider a joint distribution p(X, Y) and the product of marginal distributions p(X) p(Y). Mutual information I(X; Y) is the relative entropy between the joint distribution and the product of marginals:

Mutual information represented by KL-Divergence

Mutual information quantifies how much knowing one variable X reduces the uncertainty about the other variable Y (same applies for other direction). It is zero if X and Y are independent and peaks at the self-entropy of X if X and Y are identical.

Conditional Mutual Information in case of I(X; Y | Z) = 0

Introducing Conditionality

These concepts can be uplifted to conditional probability distributions. For example:

Conditional entropy H(X | Z) measures the entropy of X given Z. It is also the H(X)-I(X; Z), remember that I(X; Z) peaks at H(X). Conditioning on Z cannot decrease the information amount on X.
Conditional mutual information I(X; Y | Z) measures the mutual information between X and Y given Z. If I(X; Y | Z) = 0, then Z separates X and Y, which basically tells that knowing Z makes X and Y independent.

Final Words

Understanding entropy, mutual information, and KL-divergence on a simpler level helps in making more informed decisions, optimizing information models, and effectively handling uncertainty. Thank you for reading!

References

http://www.eecs.harvard.edu/cs286r/courses/fall10/papers/Chapter2.pdf