Theory PDG Notes: MLE, MAP, Conjugate Prior, Integrating over Parameters, Tractability, Bayesian, and Probability Theory
There are several terms in probability theory which keep popping up in our machine learning paper discussion. Michael Hefenbrock and I sat down to get a hang of what exactly is behind a few of these terms. This blog contains the notes we took during our long, late night “theory PDG” sessions. Big shout out to Michael (aka. mh) for sharing his knowledge and being incredibly good at explaining things all the way to the rock bottom.
Probability Theory Basics
There are two rules in probability theory that are relevant in the following. Everything else is derived from them:
Rule I: $\displaystyle p(a)=\int p(a,b),db$
Rule II: $\displaystyle p(a\mid b)=\frac{p(a,b)}{p(b)}$
Bayes rule is the first derivation, based on rule II:
Bayes Rule: $\displaystyle p(a\mid b)=\frac{p(b\mid a)\cdot p(a)}{p(b)}$
Maximum Likelihood Estimation (MLE)
Method for estimating the parameter(s) of a distribution.
Given is a set of points $\mathcal{D}={y_n}_{n=1}^N$. We choose to model the points with a normal distribution $\mathcal{N}(\mu,\sigma)$ so we can assess the probability of a new point $y$ as $p(y;\mu,\sigma)\sim\mathcal{N}(\mu,\sigma)$. The goal is to be able to compute $p(y;\mu,\sigma)$ for any $y$. To be able to do that we need to estimate $\mu$ and $\sigma$ (the parameters of the normal distribution) and that is what we use MLE for.
Note that the normal distribution is our modeling choice. The underlying data-generating distribution might very well be something else, that we do not know about.
The data is all information we have, so we want $p(\mathcal{D};\mu,\sigma)$ to be high (hence the name “maximum likelihood”). We assume the data is i.i.d. given the parameters $\mu$ and $\sigma$. We can therefore decompose $p(\mathcal{D};\mu,\sigma)$ into the product over all data points:
$$p(\mathcal{D};\mu,\sigma)=\prod_{n=1}^Np(y_n;\mu,\sigma),.$$
What we actually look for are the parameters. We are interested in the particular set of parameters for which $p(\mathcal{D};\mu,\sigma)$ is maximized. Let $\mu^\star$ and $\sigma^\star$ denote this maximizer. We search for
\begin{align}
\mu^\star=&\arg\max_\mu p(\mathcal{D};\mu,\sigma)\\
=&\arg\max_\mu\prod_{n=1}^Np(y_n;\mu,\sigma)\\
=&\arg\max_\mu\log\prod_{n=1}^Np(y_n;\mu,\sigma)\\
=&\arg\max_\mu\sum_{n=1}^N\log p(y_n;\mu,\sigma)
\end{align}
We can apply a $\log$ to the product because it will not change the maximizer. Dealing with a sum rather than a product is, however, more convenient.
The same formulation can be written down for the parameter $\sigma^\star$, to find its optimal value.
In order to actually get a value for $\mu^\star$ we replace $p(y_n;\mu,\sigma)$ with the definition of our model. Above it is a normal distribution, so we can plug in the definition of the normal distribution:
\begin{align}
\mu^\star=&\arg\max_\mu\sum_{n=1}^N\log p(y_n;\mu,\sigma)\\
=&\arg\max_\mu\sum_{n=1}^N\log \left({\displaystyle {\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {1}{2}}\left({\frac {y_n-\mu^\star }{\sigma }}\right)^{2}}}\right)\\
\end{align}
We can then compute the derivative and search for its root. This yields (after some reshaping) the solution
$$
\mu^\star=\frac{1}{N}\sum_{n=1}^Ny_n,,
$$
which is simply the mean of all data points.
Depending on the model there may be a closed-form solution (as above) for the MLE. If not, one can resort to other methods like gradient descent to find the most likely parameter value.
Maximum A Posteriori (MAP) Estimation
MAP is just like MLE an approach for estimating parameters of a distribution, except the probabilistic approach is a bit different, because it includes a prior.
Previously, with MLE, we modeled $p(\mathcal{D};\mu,\sigma)$, that is, the probability of the data given the parameters. We then looked for the parameters for which this probability is the highest. Now we model the probability $p(\mu\mid\mathcal{D})$, that is, the probability of the parameters given the data. Note that $\sigma$ is also a parameter; it’s left out for readability.
Again, analogous to MLE, we seek for the argument that maximizes the probability. Mathematically speaking, that is
$$\mu^\star=\arg\max_\mu p(\mu\mid\mathcal{D}),.$$
Following Bayes rule,
$$p(\mu\mid\mathcal{D})=\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})},,$$
we can plug the rewritten version of $p(\mu\mid\mathcal{D})$ into the optimization objective from above:
\begin{align}
\mu^\star=&\arg\max_\mu p(\mu\mid\mathcal{D})\\
=&\arg\max_\mu\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})}\\
=&\arg\max_\mu p(\mathcal{D}\mid\mu)\cdot p(\mu)
\end{align}
Note that the denominator $p(\mathcal{D})$ can be dropped because it does not depend on $\mu$.
Just like we did with MLE, we first expand $p(\mathcal{D}\mid\mu)$ into a product over all data samples available to us, and second, apply a $\log$ to ease the optimization, without affecting the position of the maximum (the best parameter configuration).
\begin{align}
\mu^\star=&\arg\max_\mu p(\mathcal{D}\mid\mu)\cdot p(\mu)\\
=&\arg\max_\mu\prod_{n=1}^Np(y_n\mid\mu)\cdot p(\mu)\\
=&\arg\max_\mu\log\left(\prod_{n=1}^N p(y_n\mid\mu)\cdot p(\mu)\right)\\
=&\arg\max_\mu\sum_{n=1}^N\log p(y_n\mid\mu)+\log p(\mu)\\
\end{align}
This is very similar to MLE, with the only difference that there is a new component in the optimization problem that we need to consider. It is the probability of the parameters, $p(\mu)$, also known as the prior.
This is where we have to explicitly make an assumption, for example, by saying our parameter is normally distributed. This is in contrast to MLE where this assumption is made implicitly.
Bayesian Modeling
Instead of just estimating a single value per parameter (like above just a scalar for the mean of our normal distribution), we now model the parameters as random variables themselves.
The parameter $\mu$ stands for a random variable, which is defined by a parameterized distribution we choose.
Terminology
- Parameters: $\mu$
- Data: $\mathcal{D}$
- Prior: $p(\mu)$
- Posterior: generally after having seen the data, i.e., $p(\cdot\mid\mathcal{D})$, for example the updated parameters $p(\mu\mid\mathcal{D})$
- Likelihood: model of the data given the parameters, i.e., $p(y\mid\mu,\mathcal{D})=p(y\mid\mu)$
- Posterior predictive distribution: $p(y\mid\mathcal{D})$
- Evidence: probability of the data $p(\mathcal{D})$
- Marginal: simplified form in which a random variable was integrated away, e.g., $p(y)$ where the $\mu$ was integrated over
Section outline. This section is a little bit nested. Its outline is
- the definition of the goal, namely the posterior predictive distribution $p(y\mid\mathcal{D})$,
- an analysis of its first component, the posterior distribution of the parameters $p(\mu\mid\mathcal{D})$, which itself is a fraction consisting of the two parts
- numerator $p(\mathcal{D}\mid\mu)\cdot p(\mu)$ with likelihood and prior (expanded in Appendix A), and
- denomintor $p(\mathcal{D})$.
- an analysis of its first component, the posterior distribution of the parameters $p(\mu\mid\mathcal{D})$, which itself is a fraction consisting of the two parts
- After drilling into them we return to the posterior predictive distribution to determine the probability of a data point (with example in Appendix B).
What we are interested in is modeling the probability of a new (test) data point $y$ given all the information we have, namely the data $\mathcal{D}$. It is called the posterior predictive distribution.
$$
\begin{align}
p(y\mid\mathcal{D})=&\int p(y\mid\mu,\mathcal{D})\cdot p(\mu\mid\mathcal{D}),d\mu\\
=&\int p(y\mid\mu)\cdot p(\mu\mid\mathcal{D}),d\mu
\end{align}
$$
We constructed the integral using Rule I. The choice of $\mu$ is an arbitrary one, one could have integrated over anything, but it is a helpful one. In the second equality we remove $\mathcal{D}$ from the first term, because the data is i.i.d. given the parameter. Having the parameter plus the data does not provide additional information.
For example: We know the mean and standard deviation of a normal distribution. Gathering additional samples from it will not provide us with additional information.
There are now two components in the integral. The first is the likelihood $p(y\mid\mu)$ and something we can compute. That is, because it was our model choice, how we decide to model $y$s.
For example: $p(y\mid\mu)=\mathcal{N}(\mu,\sigma=1)$; $\sigma=1$ for brevity.
As for the second part, the so called posterior distribution of the parameters $p(\mu\mid\mathcal{D})$, we can again apply Bayes rule as done in MAP above:
$$p(\mu\mid\mathcal{D})=\frac{p(\mathcal{D}\mid\mu)\cdot p(\mu)}{p(\mathcal{D})},.$$
This time around we cannot just discard $p(\mathcal{D})$, because we are not looking for the $\arg\max_\mu$ but the actual probability $p(\mu\mid\mathcal{D})$.
There are two difficulties: One is the numerator, one the denominator.
Numerator $p(\mathcal{D}\mid\mu)\cdot p(\mu)$. Because the data is i.i.d. given the parameters we can break it up into a product of the likelihoods of the data samples. (It is called a likelihood, because the part that we condition on is unknown, not the one in the front, in which case it would be a probability).
$$p(\mathcal{D}\mid\mu)\cdot p(\mu)=\prod_{i=1}^Np(y_i\mid\mu)\cdot p(\mu)$$
The two distributions which we multiply with each other can easily – once multiplied with each other – turn into a new distribution which is hard to handle / entirely unknown. In some cases, though, for example, for two normal distributions, we know what the product is (also normally distributed).
More generally, if the two distributions (likelihood and prior) are conjugate to each other, the resulting posterior is the same kind of distribution as the prior. Some examples are:
- Normal and normal → normal
- Bernoulli and beta → beta
- Multinoulli and Dirichlet → Dirichlet
An exemplary multiplication of Bernoulli likelihood and beta prior is in Appendix A.
Denominator $p(\mathcal{D})$. This term is just some constant. So the part that actually determines the distribution part of $p(\mu\mid\mathcal{D})$ is the numerator.
In the following, $\mathcal{D}:={y}$, i.e., there is only a single data point. Suppose we use a Bernoulli likelihood and beta prior for modeling the problem. We can reshape $p(\mu\mid\mathcal{D})$ to separate the parts which depend on the parameters from those which don’t:
\begin{align}
p(\mu\mid\mathcal{D})=&\frac{\mu^{y}(1-\mu)^{1-y}\cdot\frac{\mu^{\alpha-1}(1-\mu)^{\beta-1}}{B(\alpha,\beta)}}{p(\mathcal{D})}\\
=&\underbrace{\frac{1}{p(\mathcal{D})\cdot B(\alpha,\beta)}}_{\text{numbers}}\cdot\underbrace{\mu^{y}(1-\mu)^{1-y}\cdot\mu^{\alpha-1}(1-\mu)^{\beta-1}}_{\text{depends on }\mu}\\
=&\underbrace{\frac{1}{p(\mathcal{D})\cdot B(\alpha,\beta)}}_{\text{numbers}}\cdot\underbrace{\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1}}_{\text{depends on }\mu}
\end{align}
Every probability distribution has to integrate to 1. That helps us determine the unknown component $p(\mathcal{D})$ of the term above.
\begin{align}
\int p(\mu\mid\mathcal{D}),d\mu=&1\\
\int \frac{1}{p(\mathcal{D})\cdot B(\alpha,\beta)}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu=&1\\
\frac{1}{p(\mathcal{D})\cdot B(\alpha,\beta)}\cdot\int\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu=&1\\
\int\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu=&\underbrace{p(\mathcal{D})}_\text{unknown}\cdot B(\alpha,\beta)\\
\int\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu=&B(\alpha+y,\beta+1-y)
\end{align}
The last equality holds because we know that the left side is an unnormalized $\operatorname{Beta}(\alpha+y,\beta+1-y)$ distribution PDF. So we define the right-hand side to be its corresponding normalization constant, which is $B(\alpha+y,\beta+1-y)$. And $p(\mathcal{D})$ is just a part of it.
$$p(\mathcal{D})=\frac{B(\alpha+y,\beta+1-y)}{B(\alpha,\beta)}$$
This is an example of computing $p(D)$. It is generally difficult and often intractable.
Now we can jump back to the posterior predictive distribution, where we aim to determine the probability of a test point $y^*$ given the data. The above paragraphs helped determine the $p(\mu\mid\mathcal{D})$ part of it, and the likelihood is given by the model which is our choice and easy to evalute. In Appendix B is an example with the modeling choice $p(y\mid\mu)=\operatorname{Ber}(\mu)=\mu^{y}(1-\mu)^{1-y}$.
The Appendix examples are for a conjugate prior (Bernoulli likelihood and beta prior) in which a closed-form solution is available. In cases where there is no closed-form solution, one can solve
$$p(y^*\mid\mathcal{D})=\int p(y^*\mid\mu)\cdot p(\mu\mid\mathcal{D}),d\mu=\mathbb{E}_{p(\mu\mid\mathcal{D})}\left[p(y^*\mid\mu)\right]$$
by sampling from $p(\mu\mid\mathcal{D})$ to approximate the expectation (Monte Carlo). When sampling, the exact computation of $p(\mathcal{D})$ is conveniently no longer required.
Appendix
A. Example of Conjugate Likelihood and Prior
Our prior is $$p(\mu)=\operatorname{Beta}(1,1)=\frac{\mu^{\alpha-1}(1-\mu)^{\beta-1}}{B(\alpha,\beta)}$$
with ${\displaystyle B (\alpha ,\beta )={\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}}}$.
The likelihood is $$p(y\mid\mu)=\operatorname{Ber}(\mu)=\mu^{y}(1-\mu)^{1-y}$$
This gives the numerator
$$\mu^{y}(1-\mu)^{1-y}\cdot\frac{\mu^{\alpha-1}(1-\mu)^{\beta-1}}{B(\alpha,\beta)},.$$
For a data point $y=1$ it is
\begin{align}
&\mu\cdot\frac{\mu^{\alpha-1}(1-\mu)^{\beta-1}}{B(\alpha,\beta)}\\
=&\frac{1}{B(\alpha,\beta)}\cdot\mu\cdot\mu^{\alpha-1}(1-\mu)^{\beta-1}\\
=&\frac{1}{B(\alpha,\beta)}\cdot\mu^{\underbrace{(\alpha+1)}_{\text{new }\alpha}-1}(1-\mu)^{\beta-1}\\
\end{align}
It is apparent how the multiplication with Bernoulli did not change the distribution type. We still have a Beta distribution and its parameter $\alpha$ has changed slightly to turn into $\alpha+1$. Had our data point been $y=0$, would $\alpha$ have stayed the same whereas $\beta$ would have been incremented.
B. Computing the Posterior Predictive Distribution
The modeling choice is $p(y\mid\mu)=\operatorname{Ber}(\mu)=\mu^{y}(1-\mu)^{1-y}$.
$$
\begin{align}
p(y^*\mid\mathcal{D})=&\int p(y^*\mid\mu)\cdot p(\mu\mid\mathcal{D}),d\mu=\mathbb{E}_{p(\mu\mid\mathcal{D})}\left[p(y^*\mid\mu)\right]\\
=&\int p(y^*\mid\mu)\cdot\frac{1}{p(\mathcal{D})\cdot B(\alpha,\beta)}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu\\
=&\int p(y^*\mid\mu)\cdot\frac{1}{B(\alpha+y,\beta+1-y)}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu\\
=&\int \mu^{y^*}(1-\mu)^{1-y^*}\cdot\frac{1}{B(\alpha+y,\beta+1-y)}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu\\
=&\frac{1}{B(\alpha+y,\beta+1-y)}\cdot\int \mu^{y^*}(1-\mu)^{1-y^*}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu\\
=&\frac{1}{B(\alpha+y,\beta+1-y)}\cdot\int \mu^{y^*}(1-\mu)^{1-y^*}\cdot\mu^{(\alpha+y)-1}(1-\mu)^{(\beta+1-y)-1},d\mu\\
=&\frac{1}{B(\alpha+y,\beta+1-y)}\cdot\int \mu^{(\alpha+y+y^*)-1}(1-\mu)^{(\beta+1-y+1-y^*)-1},d\mu\\
=&\frac{1}{B(\alpha+y,\beta+1-y)}\cdot B(\alpha+y+y^*,\beta+1-y+1-y^*)\\
=&\frac{B(\alpha+y+y^*,\beta+1-y+1-y^*)}{B(\alpha+y,\beta+1-y)}\\
\end{align}
$$
$$
\begin{align}
p(y^*\mid\mathcal{D})=&\begin{cases}
\underbrace{\frac{B(\alpha+y,\beta-y+2)}{B(\alpha+y,\beta+1-y)}}_{1-\mu?} & \text{if }y^*=0\\
\underbrace{\frac{B(\alpha+y+1,\beta+1-y)}{B(\alpha+y,\beta+1-y)}}_{\mu?} & \text{if }y^*=1
\end{cases}\\
=&\operatorname{Ber}\left(\frac{B(\alpha+y+1,\beta+1-y)}{B(\alpha+y,\beta+1-y)}\right)
\end{align}
$$
Note: Here $y$ is the data that we learned from (a single point).
Here we would have almost given up. What we hope here is (as successfully validated in the following) that
$$
\begin{align}
\frac{B(\alpha+y,\beta-y+2)}{B(\alpha+y,\beta-y+1)}+\frac{B(\alpha+y+1,\beta-y+1)}{B(\alpha+y,\beta-y+1)}\overset{?}{=}&1\\\\
\frac{B(\alpha+y,\beta-y+2)+B(\alpha+y+1,\beta-y+1)}{B(\alpha+y,\beta-y+1)}\overset{?}{=}&1\\
\end{align}
$$
With the definition of $B$
$$ {\displaystyle \mathrm {B} (\alpha ,\beta )={\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}}} $$
we get the fraction
$$
\begin{align}
\frac{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+2)}{\Gamma(\alpha+y+\beta-y+2)}+\frac{\Gamma(\alpha+y+1)\Gamma(\beta-y+1)}{\Gamma(\alpha+y+1+\beta-y+1)}}{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+1)}{\Gamma(\alpha+y+\beta-y+1)}}\overset{?}{=}&1\\\\
\frac{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+2)}{\Gamma(\alpha+\beta+2)}+\frac{\Gamma(\alpha+y+1)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+2)}}{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+1)}}\overset{?}{=}&1\\\\
\frac{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+2)+\Gamma(\alpha+y+1)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+2)}}{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+1)}}\overset{?}{=}&1\\
\end{align}
$$
With the definition of ${\displaystyle \Gamma (n)=(n-1)!}$ we continue to pursue our goals at 00:43 am…
$$\Gamma (n+1)=\Gamma(n)\cdot n=n!$$
$$
\begin{align}
\frac{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+2)+\Gamma(\alpha+y+1)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+2)}}{\frac{\Gamma(\alpha+y)\Gamma(\beta-y+1)}{\Gamma(\alpha+\beta+1)}}\overset{?}{=}&1\\\\
\frac{
(\Gamma(\alpha+y)
\Gamma(\beta-y+2)+\Gamma(\alpha+y+1)\Gamma(\beta-y+1))\cdot\Gamma(\alpha+\beta+1)}
{\Gamma(\alpha+\beta+2)\cdot\Gamma(\alpha+y)\cdot\Gamma(\beta-y+1)}\overset{?}{=}&1\\\\
\frac{
(\Gamma(\alpha+y)
\Gamma(\beta-y+2)+\Gamma(\alpha+y+1)\Gamma(\beta-y+1))}
{(\alpha+\beta+1)\cdot\Gamma(\alpha+y)\cdot\Gamma(\beta-y+1)}\overset{?}{=}&1\\\\
\frac{
(
\Gamma(\beta-y+2)+(\alpha+y)\Gamma(\beta-y+1))}
{(\alpha+\beta+1)\cdot\Gamma(\beta-y+1)}\overset{?}{=}&1\\\\
\frac{\beta-y+1+\alpha+y}{\alpha+\beta+1}\overset{?}{=}&1\\\\
\frac{\beta+1+\alpha}{\alpha+\beta+1}\overset{!}{=}&1\\\\
\end{align}
$$
The feature image of the blog is from Monkey Learn’s Word Cloud generator.