Fundamentals of Information Theory
In today’s world, understanding the flow and representation of information is key. Concepts from information theory, such as entropy, mutual information, and KL-divergence, offer deep insights into how information is quantified, shared, and compared. Before diving into formal definitions, let’s explore these concepts through a more intuitive, visual approach.
Information Theory and Bit Encoding
Think of information theory in terms of bit encodings. Imagine you’re a data engineer tasked with minimizing the cost of sending messages from server A to server B. The costs scale linearly with the average number of bits used to encode the message. Consider the message “MMiiississippiii,” which contains four distinct letters. Let’s count their occurrences: ‘M’ and ‘p’ appear twice, ‘s’ appears four times, and ‘i’ appears eight times.
Encoding Schemes
Breadth-First Encoding
One way to encode the message is to use equal-length codes for all letters:
- ‘M’ gets “00”
- ‘i’ gets “01”
- ‘s’ gets “10”
- ‘p’ gets “11”
This results in an average of 2 bits per letter, and the encoded message looks like this: “00011010011010011111010101000101”. Can we do better?
Depth-First Encoding
Now, let’s take advantage that some letters occur more frequently than others. The most frequent letters must get the shortest descriptors:
- ‘i’ gets “0”
- ‘s’ gets “01”
- ‘p’ gets “011”
- ‘M’ gets “0111”
This encoding results in a 2 bits smaller encoding than before: “011100101001010011011000011100”. Ever heard of Huffman encoding?
Huffman Encoding
Similar as depth-first encoding, we will take advantage of the fact that some letters occur more frequently. Additionally, we will do it even better by merging two least occurring groups of letters recursively and routing respectively. Using Huffman encoding:
- ‘i’ gets “0”
- ‘s’ gets “10”
- ‘p’ gets “110”
- ‘M’ gets “111”
This encoding scheme reduces the message length to: “011100101001010011011000011100”, which is even 2 bits shorter than the depth-first encoded message.
Entropy: Measuring Uncertainty
To understand why Huffman encoding is more efficient, let’s calculate the likelihood and the average number of bits (negative log-likelihood with base 2) of each letter:
Entropy reflects the average number of bits needed to encode a letter, also it is the sum of the number of encoding bits for all letters, weighted by their probability. It is defined as:
Now, the entropy of the probability distribution looks like:
Relative Entropy (KL-Divergence): Difference of Distributions
What if we used a different encoding scheme optimized for a different distribution? Suppose we have an optimal encoding for a string with the letter distribution:
Using this encoding for our original message, the average encoding length becomes:
This is known as cross-entropy:
Relative entropy, or KL-Divergence, measures the difference between cross-entropy and entropy:
In our case:
This indicates an increase in the average number of bits needed to represent the message when using a suboptimal encoding.
Mutual Information: Quantifying Shared Information
Now, let’s simplify the notations for random variables, assuming p is the only distribution:
Consider a joint distribution p(X, Y) and the product of marginal distributions p(X) p(Y). Mutual information I(X; Y) is the relative entropy between the joint distribution and the product of marginals:
Mutual information quantifies how much knowing one variable X reduces the uncertainty about the other variable Y (same applies for other direction). It is zero if X and Y are independent and peaks at the self-entropy of X if X and Y are identical.
Introducing Conditionality
These concepts can be uplifted to conditional probability distributions. For example:
- Conditional entropy H(X | Z) measures the entropy of X given Z. It is also the H(X)-I(X; Z), remember that I(X; Z) peaks at H(X). Conditioning on Z cannot decrease the information amount on X.
- Conditional mutual information I(X; Y | Z) measures the mutual information between X and Y given Z. If I(X; Y | Z) = 0, then Z separates X and Y, which basically tells that knowing Z makes X and Y independent.
Final Words
Understanding entropy, mutual information, and KL-divergence on a simpler level helps in making more informed decisions, optimizing information models, and effectively handling uncertainty. Thank you for reading!