Probability Distributions 1

The Subtle Art of Knowing What You Don’t Know — Part 1: Non-Parametrized Density Estimation

Yamac Eren Ay
7 min readJul 19, 2024
https://en.wikipedia.org/wiki/Black_box

Functions are essential in modelling real-life mechanisms. Basically, we want to input features X into these black boxes f and expect to get desired labels Y in return, without further questioning what happens under the hood. In best case, we happen to know such causal, physical mechanisms, for example the temperature drop being proportional to altitude on earth. But in most cases, we have to (approximately) learn such functions f : X → Y using domain expertise, inferential statistics, machine learning or whatever fancy optimization technique we need.

https://mriquestions.com/softmax.html

Then, the question arises: what if we’re not only interested in the input-output mapping, but also the (un-)certainty of the mapping itself? This leads us to probabilistic learning, where instead of learning deterministic functions, we learn probability distributions, specifically P(Y | X), which can be read out as “Given X, what is the probability of Y?”. In other words, learning the probability distribution not only provides the best label estimate given the features but also quantifies the certainty of the predictions.

There are two different ways to estimate a probability distribution:

  1. Non-parametric estimation intends to learn all (!) points of a probability distribution as accurately as possible.
  2. Parametric estimation learns just a few characteristic parameters describing a distribution, which I will cover in my next article.

In this article, I will illustrate a few non-parametric estimation techniques in case of neural spikes. Very simplistically, neurons are special cells which receive a (weighted) sum of inputs from other neurons (through synapses) and return a binary output YES (1) or NO (0) in form of neural spikes. If the input is larger than the threshold, then a spike is triggered and a YES response is sent to other neurons, otherwise it remains unchanged and a NO response is sent per default.

Left: Neuron model (khanacademy.org), Right: Action potential (moleculardevices.com)

So for example, let’s sample stimulus x ( = spike data consisting of 1s and 0s) by observing the output of a neuron over n = 1000 time steps, which looks like this:

Even though this distribution looks like extremely exact and correct, it is a huge problem that we cannot generalize this distribution well: In fact, neural spikes are fired probabilistically, and there might be other “exact” true distributions which follow the same ground-truth but have neural spikes at completely different time points.

So we have to look for a potentially smoother distribution, which smartly takes the spikes’ noisy nature into account and estimates the ground-truth. What if we split the data into (time) bins of size k = 10, 50, 200 and calculate the average number of spikes in each bin?

As you can see, the choice of bin size plays a key role here. With the increasing k, the distribution will be more and more robust to noise, while causing loss of information across bins, so better search for a good bin size!

Now, more on the information loss. Without counting the occurrences, we would normally have full access to the information whether a spike occurs before another spike. However, if we split the data into separate bins, the spikes across bins “lose contact”, which makes the whole picture incomplete. To get away with this, we could apply the moving average method instead, also move the bin by d steps to the right and calculate the average of each bin. Fewer neighbor relationships left intact!

This looks nicer and the precision of the spikes seems to be less affected by the choice of bin size, if we pick a step size d = 1. On another note, we can represent the previous approach by just setting the step size d = k. But, this approach gives equal attention to all neighbors of a spike, which might be a problem if the bin size is chosen too large. Is there any way to make the connection of not-related spikes less significant? So, probably, we need a more general approach called convolution, which can express arbitrary summations on stimulus, including both previous approaches.

Towards Smarter Convolution Filters

Convolution can be defined as “the integral of the product of the two functions after one is reflected about the y-axis and shifted”. In our case, these two functions are the stimulus x and the filter h (both extended to infinitely large domains). Furthermore, the stimulus is an array, so we can also write it as a summation Σ instead.

https://www.quora.com/Why-is-the-sum-of-two-random-variables-a-convolution

As a matter of fact, you can represent the moving average as a convolution with an uniform filter h(t) = 1 / k if t is contained in the bin of size k, otherwise 0. But the main point is: These are just examples of filters, and we could just pick any filter which suits our needs and works!

Given any reference time point t, we want to construct the smartest filter possible h which must be minimally biased. For simplicity, we know that the desired filter has a fixed standard variation σ. Well, the most educated guess would be to assume the desired filter follows a distribution, which maximizes the entropy. More on entropy, you can read my previous article.

Assumingly, there are infinitely many filters which pays more attention to the past spikes (see the upper left distribution in the plot below). Symmetrically, there are also that many filters which does the same for future spikes (see the upper right distribution below), so that the sum of all latent distributions must be symmetric along y-axis and centered at µ = t. Central Limit Theorem tells that if you sum infinitely many distributions, it converges to a Gaussian distribution, which looks like a bell curve:

https://www.analyticsvidhya.com/blog/2019/05/statistics-101-introduction-central-limit-theorem/

If you still don’t get a clear picture, I recommend you to watch the random walk simulation. What do you see at the end?

Gauss Filter

Basically, if the latent variables shifting to the left and the ones shifting to the right are equally “powerful”, then we expect a symmetric bell-curve distribution with the referenced time point as the mean. Below is the Gaussian filter with the centered time point x (following a zero-mean distribution):

Gaussian Kernel

If you fix the kernel width σ (sigma) to higher values, you will get smoother representations. In the plot below, see that with the increasing sigma, the surface graph gets more blurry.

How to Compute Convolution Efficiently?

As efficiently computable as it seems in case of a few possible outcomes, this can be quite tricky if it is applied on a large time interval, because the number of single computation steps increase quadratically with the increasing number of time points. How to get away with the quadratic complexity O(n * n)?

Assume that the time domain contains mixed signals of different sources from the frequency domain, and we want to decompose the signals into its ingredients. Fast Fourier Transform (FFT) is the algorithm, which performs this within a time complexity of O(n * log n). Inverse FFT works the same way but in opposite direction (from frequency to time). Why do we care about the frequency domain?

In fact, the convolution in the time domain corresponds to a multiplication in the frequency domain, so luckily there is no need to double-loop over the entire time domain! If you want to learn more about the convolution and its obvious connection to Fast Fourier Transform, I definitely recommend this article:

Discussion

Given random spikes or data points, we could indeed estimate the non-parametric density function using clever tricks in the following order: First, we counted the signals at regular time intervals, then we switched to moving average to tackle with the neighborhood information loss, and finally we used convolution filters as the most general approach, and found out that Gaussian filter should work best if we don’t know the underlying distribution.

The main point is: we can approximate the real distribution up to an arbitrary precision, but it does not necessarily generalize well. For making further guesses on unseen data, we might need simple characteristics such as the slope, the area under the curve or the mean. This is a perfect time to end this article and advertise for my next article, where I illustrate the parametric estimation with a quick detour on probability theory. If you enjoyed my article, I’d be happy to receive your feedbacks!

--

--

Yamac Eren Ay
Yamac Eren Ay

No responses yet