Probability Distributions 3

The Subtle Art of Knowing What You Don’t Know — Part 3: A Closer Look to Likelihood-Prior Relationship

6 min readJul 19, 2024

In my first article, we talked about some non-parametric density estimation methods and later found out that it might be computationally expensive to approximate all points.

Probability Distributions 1

The Subtle Art of Knowing What You Don’t Know — Part 1: Non-Parametrized Density Estimation

yamaceay.medium.com

In my second article, we tackled this problem by using parametric density estimation methods, which — in exchange for additional assumptions about the underlying distribution — offer better generalization and more efficient computation.

Probability Distributions 2

The Subtle Art of Knowing What You Don’t Know — Part 2: Parametrized Density Estimation

yamaceay.medium.com

In this article, we want to finally learn how to estimate prior distribution given a likelihood estimation, and wrap up everything.

Let’s say we are given a likelihood estimation. A prior distribution is conjugate to a likelihood distribution, if both the prior and posterior follow the same type of distribution, let’s say F-distribution. So, the belief can be updated only by adjusting the parameters of F. The parameters of F might be some statistics regarding the history of belief updates. The input of F are the parameters of the likelihood functions, which are tested.

Binomial and Beta Distribution

In order to digest all this, let’s continue the Bernoulli experiment from the previous article, but this time, we want to do multiple (n) games and count the number of red occurrences. So, the likelihood follows a Binomial distribution with the following density function:

The likelihood of a sequence with k red occurrences is given by the blue part of the formula, and the binomial coefficient in the yellow part tells how many such occurrences exist. Binomial distribution answers how likely k success events happen, given a number of trials n and a success probability p.

Example Binomial Distribution, https://stats.stackexchange.com/questions/176425/why-is-a-binomial-distribution-bell-shaped

Assume that the likelihood following a Binomial distribution accepts a prior / posterior following a Beta distribution.

Intuitively, this distribution needs two parameters: a and b, tracking the number of past red and black occurrences respectively. So, if the prior follows Beta(P | a, b), then the posterior after k red and n - k black outcomes must follow Beta(P | a + k, b + (n - k)).
This distribution must also take the success probability p as input, which is the only variable in the likelihood function to be tested.

As it turns out, the Beta distribution defined below satisfies this condition:

Notice the correspondence between Beta and Binomial distribution? Γ(a) can be regarded as an analytical continuation of (a - 1)!, and B(a, b) seem to be a pseudo-binomial coefficient, which makes a probability distribution by normalizing the term. Also, you might wonder why we take a-1 and b-1 and why not a and b?

Because prior is in fact a way to represent the following information: Given a-1 red and b-1 black occurrences, when does the next red / black occur?

Example Beta Distribution, https://byjus.com/maths/beta-distribution/

On another note, Beta prior can be used with Bernoulli distribution as well, since it is a special case of Binomial distribution with n = 1. So, a prior can be updated given different types of likelihoods.

Geometric Distribution

Can we somehow measure the distance between two success events, given that a variable has a fixed success probability p as in Bernoulli distribution?

Yes, the distribution is called Geometric distribution and it tells how likely the next success event will happen within n trials, given a success probability p. The formula is pretty self-explanatory:

Poisson, Gamma and Exponential Distribution

Now, we consider another likelihood-prior pair, which is very crucial in neuroscience: Poisson distribution as likelihood and Gamma distribution as prior. As you may recall from the first article, we can count the number of neural spikes in regular time intervals to get a more general estimation of spike rates. Why Poisson distribution (and not Binomial distribution)?

Example Poisson Distribution, https://calcworkshop.com/discrete-probability-distribution/poisson-distribution/

Theoretically, if you make the time bins extremely small, so that there are infinitely many bins (n), the success probability (p) of a spike will go near zero as well. Apparently, the spike rate remains constant λ = n · p even with p being infinitely close to zero and n being infinitely large, because the time interval τ is fixed. Poisson distribution is an approximation of Binomial distribution, which is given by the formula below:

Now, we’re looking for a Gamma distribution, which tells us the total time ∆t for k spikes to happen with an instantaneous rate r. Instantaneous rate r is the slope of the spike rate with respect to time, as given below:

Remember that Poisson tells us how many spikes we expect in a fixed time interval τ with rate λ, so we can rewrite λ = r τ. With this in mind, let’s construct the Gamma distribution step by step.

At the beginning, imagine a special case of Gamma distribution with k = 1, which should tell us the time between each spike ∆t with the same instantaneous rate r.

It is easy to show that a Poisson(X = 0 | r τ) represents the likelihood that no spike occurs at given time interval τ with an instantaneous spike rate r.

Multiplying this term with the instantaneous spike rate r yields Exponential distribution, which is specified by the following formula:

It is not a coincidence that it looks like an analytical continuation of Geometric distribution:

Example Exponential Distribution, https://byjus.com/maths/exponential-distribution/

Now, let’s define Gamma distribution for arbitrary k values by following the same argumentation:

Poisson(X = k - 1 | r τ) telling how likely will k - 1 spikes occur at given spike rate r τ.
Exponential(T = ∆t | r) telling how likely will the k-th spike occur at given time interval ∆t.

Gamma distribution must be the multiplication of both factors, which — after a few intermediate steps — looks like:

Given a prior following Gamma(T | k, r) and a new observation with m spikes in n time bins, then the posterior distribution follows Gamma(λ | k + m, r + n). All in all, we can represent beliefs and perform belief updates using Gamma, when the likelihood distribution is Poisson distributed.

Gamma vs. Poisson Distribution, https://bcss.org.my/tut/bayes-with-jags-a-tutorial-for-wildlife-researchers/simple-model-with-one-parameter/

Discussion

I hate to say that I took so many detours and explained the content very chaotically. It’s because all concepts are densely connected with each other, and it’s possibly impossible to fit them altogether into an article series. Or maybe because I wrote this entire article series in a single day right before an exam week. There are also things that I didn’t mention and you might find interesting, such as:

Any likelihood following a Normal distribution with a fixed standard deviation σ has a conjugate prior which follows another Normal distribution, so the update itself can be a prior or a posterior: https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture5.pdf

I urge you to take a closer look at the corresponding distributions, and explore the hidden connections yourself. It’s in fact more fun than you might imagine. Stay tuned for more content like this, and thank you for your support!