Probability Distributions 2
The Subtle Art of Knowing What You Don’t Know — Part 2: Parametrized Density Estimation
In my previous article, we talked about the non-parametric density estimation:
However, we didn’t really point out the performance issue: In fact, estimating all points of a distribution can be nerve-wracking, if we consider an arbitrarily large time interval. Luckily, using our knowledge, we can assume that the data follows a specific underlying distribution, characterized by a few parameters.
Now, let’s take a look at a fun visual case, where we try to predict the shape of given item as accurately as possible.
For example, in a parallel universe, 15 “imperfect” shapes are given in total, categorized as 6 circles, 6 diamonds and 3 trapezoids. What is the probability that a given shape is …
- circle? 6 / 15
- diamond? 6 / 15
- trapezoid? 3 / 15
It was a no-brainer. If you were given a chance to vote for a shape, which shape(s) would it be? If you say either circle or diamond, you are right.
Now, as it turns out, we’re given a new update that the underlying shape “has 4 corners”, which we have to somehow incorporate to our prior belief. The probability distribution of likelihood P(X1 = “has 4 corners” | Y) is given for each Y as below:
- circle? 1 / 13
- diamond? 6 / 13
- trapezoid? 6 / 13
If you don’t understand why a circle doesn’t have zero probability, recall that I mentioned that all shapes are imperfect: Hypothetically, there can be shapes having 4 corners classified as a circle, but also shapes not having 4 corners classified as diamond or trapezoid.
So, how can we fuse our prior belief with the incoming update and get a new belief in return? We can multiply the prior P(Y) and the likelihood P(X1 | Y) to obtain a likelihood-prior product. The sum of all likelihood-prior products over each shape yields a normalization factor P(X1), which expresses how plausible the evidence that the shape “has 4 corners“. If we divide all likelihood-prior products by the normalization factor, we get a new probability distribution P(Y | X1), which stands for our posterior (~ updated) belief:
- circle? 1 / 10
- diamond? 6 / 10
- trapezoid? 3 / 10
Now, let’s repeat the same approach for a new update: the shape “is vertically symmetric” (or is symmetric along x-axis). The likelihood function P(X2 = “is vertically symmetric” | Y) is given for each Y as below:
- circle? 9 / 19
- diamond? 7 / 19
- trapezoid? 3 / 19
Assume that our prior belief is the updated belief we get from the previous update, also take P(Y’) = P(Y | X1) as the new prior. Now, pay attention: the sum of likelihood-priors is in fact not necessarily P(X2), but P(X2 | X1), because we implicitly assume from the previous step that the given shape “has 4 corners”. If and only if both updates are independent from each other, also the information that the given shape “has 4 corners” doesn’t tell anything about the “symmetry along x-axis”, we get P(X2 | X1) = P(X2). Then, for each Y’, the updated belief P(Y’ | X2) looks like:
- circle? 3 / 20
- diamond? 14 / 20
- trapezoid? 3 / 20
Finally, see how these parts interact in the bigger frame:
We can repeat such updates as long as possible. At the n-th belief update P(Xn = “another information” | Y), we expect the normalization factor P(Xn | X1, …, X{n-1}) and the posterior P(Y | X1, …, X{n-1}). Interesting remarks:
- The multiplication of all evidences P(X1), P(X2 | X1), …, P(Xn | X1, …, X{n-1}) equals P(X1, X2, …, Xn), also the evidence as a joint distribution of all updates.
- Obviously, if a variable Xn matches one of X1, …, X{n-1}, then the total evidence stays the same: P(X1, …, X{n-1}, Xn) = P(X1, …, X{n-1}).
- We assumed that the updates solely depend on the classes (and not on the other updates), so the product of all likelihoods P(X1 | Y), …, P(Xn | Y) can be written as P(X1, …, Xn | Y).
- Equivalently, we could just make all updates at once by multiplying the prior P(Y) with the new likelihood P(X1, …, Xn | Y), then dividing by the evidence P(X1, …, Xn) to get the posterior P(Y | X1, …, Xn).
Comparing Beliefs using MAP and MLE
If we are interested in whether a class Y1 or another class Y2 is more likely, we can compare the posteriors P(Y1 | X1, …, Xn) ?=? P(Y2 | X1, …, Xn). In this simple comparison, the evidences cancel out, since they are constant with respect to Y’s.
So we’re left with a simpler problem: The most plausible hypothesis is the one which maximizes the likelihood-prior product. In fact, this is the fundamental idea behind Maximum A-Posteriori (MAP) Estimation!
Moreover, each prior distribution yields an entropy ( ~ a measure of uncertainty), characterized as H(p) = -Σ p(i) log p(i), also the expected number of bits for sending a message from a channel A to B. The less we know about the prior distribution, the more we have to maximize the entropy by the right selection of prior distribution.
For example, we started with the initial belief that the shapes are distributed as p = [2 / 5, 2 / 5, 1 / 5], which yields an entropy H(p) ≈ 1.06. In extreme case, assuming another prior p = [1, 0, 0] with all shapes assumingly circle, we get an entropy H(p) = 0. If we assumed all shapes are equally likely (p = [1 / 3, 1 / 3, 1 / 3]) — which happens to be the most educated guess if we don’t know anything — , we obtain the highest entropy possible: H(p) ≈ 1.10. More on intuition behind entropy, you can refer to my article below:
So, if we started from the initial belief where we don’t know anything, we would have to assume it is uniformly distributed (with P(Y1) = P(Y2) in case of likelihood-prior product comparison), and thus the prior cancels out in the comparison of likelihood-prior products: P(X1, …, Xn | Y1) ?=? P(X1, …, Xn | Y2). This approach is called Maximum Likelihood Estimation (MLE). I rather think of MAP as an intermediate step of MLE, where we start from an arbitrary belief update. Without further ado, let’s dive into a very well-known use case of MLE.
MLE Use Case
Bernoulli experiment is an randomized experiment which can be characterized by a (perfectly) binary roulette, where the player is expected to bet on red or black repeatedly. Strong assumption made here is that the likelihood that the next outcome is red (or black) remains constant over time: p (or 1 - p). It can be written as a general likelihood function P(Y = y) as below (where y = {0, 1} is the count of red occurrence):
If we had to repeat this experiment for m games, we would get P(Y = y1) * … * P(Y = ym) = ∏ P(Y = yi). For example, the likelihood of “first red twice, then black once” is p * p * (1 - p) with y1 = y2 = 1 and y3 = 0. So, can we somehow maximize the probability ∏ P(Y = yi)?
In nearly all typical use cases, the likelihood functions match a product of success probabilities (up to a constant factor). Each success probability is in range between [0, 1], so at each multiplication, the values get closer to zero, which leads to (possibly) numerical instability issues — imagine that the computers cannot process (astronomically small) floating point numbers which are beyond the machine precision.
So, a better idea is to just calculate the logarithm of the products of single likelihoods, which translates the whole maximization problem to a sum of log-likelihoods. Equivalently, we can negate the whole term and get negative log-likelihood, which then reflects as an error. Welcome binary cross-entropy (BCE)!
The term is differentiable with respect to p, so we can find the minimum of BCE by setting the first derivative of this term equal zero.
Finally, we find out that the probability of the “red wins” can be best estimated by the empirical average of red occurrences. Logically!
Discussion
This article aims you to understand how the belief update works in general, and how can we find the most believable hypothesis by comparing the posteriors, which leads to the discovery of MAP and MLE (with an additional assumption that we start without knowing anything). Additionally, I wanted to quickly demonstrate the power of MLE using a very simple case with Bernoulli experiments where we intend to estimate the success probability of an event.
So we know how to estimate a likelihood distribution using well-known differentiable parametric estimation methods, but we haven’t discussed yet how the prior and posterior can be estimated (given a likelihood function). Feel free to navigate to the part 3 of my series: