**1. Background**

Learning restricted Boltzmann machine (RBM) is not easy for people who don’t have much knowledge in statistics. Even for people who have learned basic machine learning/data mining, it is not very straight forward. The students in my Neural Network and Deep Learning class struggled quite a bit in understanding the material.

**2. Resources**

There are many excellent online resources that introduce RBM. Among them, I found the following two were the most useful for preparing my lecture:

Fischer and Igel’s tutorial on RBM:

http://image.diku.dk/igel/paper/AItRBM-proof.pdf

It is very thorough and provides necessary background.

Ghodsi’s lecture video and slides:

https://uwaterloo.ca/data-science/sites/ca.data-science/files/uploads/files/dbn2.pdf

The lecture slides provide a clean and concise summary of the necessary equations. The lecture video is very clear. However several equations were not sufficiently explained due to time constraints.

To me they are the best resources I can find to learn RBM.

**3. Big picture**

The purpose of RBM is to compute your training data’s probability distribution. For high dimensional data, histogram-like non-parametric representation is not tractable. We need a parameterized generative model to represent the data’s probability distribution. For people who are familiar with machine learning, you would think of *expectation-maximization (EM)* algorithm. It deals with data that can be assumed to be from a mixture of distributions (Gaussian in most cases). For people who are familiar with basic statistics, you would think of *maximum*–*likelihood* estimation (MLE). When you learn it the first time in college, the application probably was to estimate the mean and variance of Gaussian distribution given a set of data. However MLE is much more general. The concept is to search the right set of distribution model parameters to maximize the likelihood function (the *probability* of those data given the parameter values). The maximized likelihood function is the distribution of the data since it agree the most with the data. RBM is an approach that computes MLE, especially for the ones you don’t know the formula of the distributions.

**4. Probability distribution representation**

Well, if you don’t know the formula of the distribution model, but you don’t want to use bean counting for high-dimensional space data, RBM is your tool. First, you generate a few hidden/latent variables from thin air, then, you say the probability distribution of your data is a function of those hidden variables. Since they are latent variables, they can be anything. So your claim is always true. Well, for RBM, your hidden variables are from the data. So you can learn your hidden variables. Now, you have your parameterized probability distribution representation (at lease potentially).

**5. Construct RBM**

For now, we only deal with binary data. Each data point in a set of training data is in a D-dimensional space and the values in each dimension could only be or . For example, we may have a dataset that has four three-dimensional data points , , , and . To construct a RBM to estimate the probability density function of the dataset , we would need to have three visible nodes in corresponding to the three dimensions. Assume we have two hidden/latent variables, so, we would need two hidden nodes. The nodes between the the visible layer and the hidden layer are fully connected to form a graph called RBM. It is a typical complete bipartite graph as shown below.

To be more precise, there are weights () associated to the edges and biases ( and ) to the binaries. The weights and the biases are the parameters for the probability density model , where .

The workflow of this structure is two directional. Given , the RBM computes the probability of . For example, given , the RBM computes the probabilities and . Through the other direction, given , the RBM computes the probability of . For example, given , the RBM computes the probabilities , , and .

There are many clear ways of using a trained RBM. People stack them together to form a deep network and call it deep belief network (DBN). RBM can also be used by itself to remove noise, obtain features, deal with missing data, or generate new data points. It is very similar to autoencoder for some applications.

**6. Train RBM**

To train a RBM, as any other optimization problem, the routine to define a loss or goal function and then compute gradient. Then use gradient descent/ascent to find the right coefficients.

**Formulate goal function**

Since the goal is to obtain the training data’s probability distribution, the natural goal formulation is the maximum likelihood for all data points:

,

where represents the training data set and is one data point in the data set.

So the goal function is . with log likelihood, it becomes . can be computed by marginalizing over :

Since a RBM is a probability network, the joint probability is defined as

,

where is the network’s global energy function, is a partition function to normalize the total probability to 1.

From Hopfield nets, the global energy function is defined as

,

where , , and are the weights of the network.

Now, the maximum log likelihood can be computed as the following,

(1)

Compute derivative of the goal function (for this case),