1. Background
Learning restricted Boltzmann machine (RBM) is not easy for people who don’t have much knowledge in statistics. Even for people who have learned basic machine learning/data mining, it is not very straight forward. The students in my Neural Network and Deep Learning class struggled quite a bit in understanding the material.
2. Resources
There are many excellent online resources that introduce RBM. Among them, I found the following two were the most useful for preparing my lecture:
Fischer and Igel’s tutorial on RBM:
http://image.diku.dk/igel/paper/AItRBM-proof.pdf
It is very thorough and provides necessary background.
Ghodsi’s lecture video and slides:
https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/dbn2.pdf
The lecture slides provide a clean and concise summary of the necessary equations. The lecture video is very clear. However several equations were not sufficiently explained due to time constraints.
To me they are the best resources I can find to learn RBM.
3. Big picture
The purpose of RBM is to compute your training data’s probability distribution. For high dimensional data, histogram-like non-parametric representation is not tractable. We need a parameterized generative model to represent the data’s probability distribution. For people who are familiar with machine learning, you would think of expectation-maximization (EM) algorithm. It deals with data that can be assumed to be from a mixture of distributions (Gaussian in most cases). For people who are familiar with basic statistics, you would think of maximum–likelihood estimation (MLE). When you learn it the first time in college, the application probably was to estimate the mean and variance of Gaussian distribution given a set of data. However MLE is much more general. The concept is to search the right set of distribution model parameters to maximize the likelihood function
(the probability of those data given the parameter values). The maximized likelihood function is the distribution of the data since it agree the most with the data. RBM is an approach that computes MLE, especially for the ones you don’t know the formula of the distributions.
4. Probability distribution representation
Well, if you don’t know the formula of the distribution model, but you don’t want to use bean counting for high-dimensional space data, RBM is your tool. First, you generate a few hidden/latent variables from thin air, then, you say the probability distribution of your data is a function of those hidden variables. Since they are latent variables, they can be anything. So your claim is always true. Well, for RBM, your hidden variables are from the data. So you can learn your hidden variables. Now, you have your parameterized probability distribution representation (at lease potentially).
5. Construct RBM
For now, we only deal with binary data. Each data point in a set of training data
is in a D-dimensional space and the values in each dimension could only be
or
. For example, we may have a dataset
that has four three-dimensional data points
,
,
, and
. To construct a RBM to estimate the probability density function of the dataset
, we would need to have three visible nodes in corresponding to the three dimensions. Assume we have two hidden/latent variables, so, we would need two hidden nodes. The nodes between the the visible layer and the hidden layer are fully connected to form a graph called RBM. It is a typical complete bipartite graph as shown below.
To be more precise, there are weights () associated to the edges and biases (
and
) to the binaries. The weights and the biases are the parameters for the probability density model
, where
.
The workflow of this structure is two directional. Given , the RBM computes the probability of
. For example, given
, the RBM computes the probabilities
and
. Through the other direction, given
, the RBM computes the probability of
. For example, given
, the RBM computes the probabilities
,
, and
.
There are many clear ways of using a trained RBM. People stack them together to form a deep network and call it deep belief network (DBN). RBM can also be used by itself to remove noise, obtain features, deal with missing data, or generate new data points. It is very similar to autoencoder for some applications.
6. Train RBM
To train a RBM, as any other optimization problem, the routine to define a loss or goal function and then compute gradient. Then use gradient descent/ascent to find the right coefficients.
Formulate goal function
Since the goal is to obtain the training data’s probability distribution, the natural goal formulation is the maximum likelihood for all data points:
,
where represents the training data set and
is one data point in the data set.
So the goal function is . with log likelihood, it becomes
.
can be computed by marginalizing over
:
Since a RBM is a probability network, the joint probability is defined as
,
where is the network’s global energy function,
is a partition function to normalize the total probability to 1.
From Hopfield nets, the global energy function is defined as
,
where ,
, and
are the weights of the network.
Now, the maximum log likelihood can be computed as the following,
(1)
Compute derivative of the goal function (for this case),