# Bayesian Finite Mixture Models

## Motivation

I have been lately looking at Bayesian Modelling which allows me to approach modelling problems from another perspective, especially when it comes to building Hierarchical Models. I think it will also be useful to approach a problem both via Frequentist and Bayesian to see how the models perform. Notes are from Bayesian Analysis with Python which I highly recommend as a starting book for learning applied Bayesian.

## Mixture Models

In statistics, mixture modelling is a common approach for model building. A Model built by simpler distributions to obtain a more complex model. For instance,

• Combination of two Gaussian’s to describe a bimodal distribution.
• Many Guassian’s to describe any arbitrary distributions.

We can use a mixture of models for modelling sub-populations or complicated distributions which can not be modelled with simpler distributions.

## Finite Mixture Models

In Finite mixture models, as the name suggests, we mix a known number of models together with some weights associated for each model. Probability density of the observed data is a weighted sum of the probability density for K subgroups of the data where K is the number of models.

$p(y|\theta) = \sum_{i=1}^{K} w_{i}p_{i}(y_{i}|\theta_{i})$

Here, $$w_{i}$$ is the weight for each group and all the weights should sum to 1. The components $$p_{i}(y_{i}|\theta_{i})$$ can be anything like Guassian, Poisson all the way to neural networks. We should know the number of K in advance, this can be either we know it beforehand or need to provide educated guess.

## Categorical Distribution

Similar to our use of Bernoulli distribution to model two outcomes (0 or 1), we can use Categorical distribution to model K outcomes.

## Dirichlet distribution

Dirichlet distribution is a generalization of the beta distribution. We use Beta distribution for two outcomes, one with probability p and the other 1-p. Beta distribution returns a two element vector like (p,1-p). If we want to extend beta distribution to three outcomes, we can use vector like (p,q,r). For K outcomes, we use a vector $$\alpha$$ with length K. Check this post out for more intuitive detailsVisualizing Dirichlet Distributions

## How to choose K

One of the main concerns with finite mixture models is how to decide the number of K. Generally one tries with a lower number of K and increase it gradually after evaluating model. In Bayesian modelling, we use evaluate models using posterior-predictive checks like WAIC or LOO.

## Example

Let’s take a look at an example by first generating 3 random Gaussian distributions

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import operator
import pandas as pd
import pymc3 as pm
import scipy.stats as stats
import seaborn as sns

data_size = 1000
y0 = stats.norm(loc=50, scale=2).rvs(data_size)
y1 = stats.norm(loc=200, scale=100).rvs(data_size)
y2 = stats.norm(loc=300, scale=10).rvs(data_size)
y_data = y0 + y1 + y2
y_data = pd.Series(y_data)
fig, ax = plt.subplots(3,1, figsize=(10,5))
sns.distplot(g0, ax=ax[0])
sns.distplot(g1, ax=ax[1])
sns.distplot(g2, ax=ax[2])


fig, ax = plt.subplots(figsize=(10,5))
sns.distplot(y_data)


Let’s try with clusters 2,3 and 4

clusters = [2, 3, 4]
models = []
traces = []
for cluster in clusters:
with pm.Model() as model:
p = pm.Dirichlet('p', a=np.ones(cluster))
means = pm.Normal('means',
mu=y.mean(),
sd=10, shape=cluster)
sd = pm.HalfNormal('sd', sd=100)
y = pm.NormalMixture('y', w=p, mu=means, sd=sd, observed=y_data)
trace = pm.sample()
traces.append(trace)
models.append(model)


Comparing the WAIC and LOO scores, we can see that the lowest score is for 3 Clusters.

cmp_df = az.compare({
"2_clusters": traces[0],
"3_clusters": traces[1],
"4_clusters": traces[2]
}, ic="waic")
cmp_df


cmp_df = az.compare({
"2_clusters": traces[0],
"3_clusters": traces[1],
"4_clusters": traces[2]
}, ic="loo")
cmp_df


## Next

In next post, we will look into Non-Finite Mixture models.

## References

Written on June 22, 2020