# Harmoniums and Contrastive Divergence (Hinton; 1999) In NGC-Learn, it is possible to construct other forms of learning from the very base learning/plasticity components already in-built into the base library. Notably, a class of learning and inference systems that adapt through a process known as contrastive Hebbian learning (CHL) can be constructed and simulated with ngc-learn. In this walkthrough, we will design a simple Harmonium, also known as the restricted Boltzmann machine (RBM). We will specifically focus on learning its synaptic connections with an algorithmic recipe known as contrastive divergence (CD), which can be considered to be a stochastic form of CHL. After going through this exhibit, you will: 1. Learn how to construct a model context that emulates the structure of an RBM and simulate its inference/reconstruction process to calculate approximate synaptic weight gradients in accordance to contrastive divergence (including an extension of it called persistent contrastive divergence). 2. Simulate fantasized image samples using a block Gibbs sampler that is defined by (re-)using a portion of the model's message-passing structure. Note that the folders of interest to this walkthrough are: + `ngc-museum/exhibits/harmonium/`: this contains the necessary simulation scripts (which can be found [here](https://github.com/NACLab/ngc-museum/tree/main/exhibits/harmonium)); + `ngc-museum/data/mnist/`: this contains the zipped copy of the MNIST digit image arrays ## On the Harmonium Probabilistic Graphical Model A harmonium is a generative model implemented as a stochastic, two-layer neural system (a type of probabilistic graphic model; PGM) that attempts to learn a probability distribution over sensory input $\mathbf{x}$, i.e., the goal of a harmonium is to learn $p(\mathbf{x})$, the underlying probability/likelihood of a given (training) dataset. Fundamentally, the approach to estimating $p(\mathbf{x})$ that carried out by a harmonium is through the optimization of an energy function $E(\mathbf{x})$ (a concept motivated by statistical mechanics), where the system searches for an internal configuration, i.e., the values of its synapses, that assigns low energy (values) to sample patterns that come from the true data distribution $p(\mathbf{x})$ and high energy (values) to patterns that do not (or those that do not come from the training dataset). ```{eval-rst} .. table:: :align: center +-----------------------------------------------------------------+ | .. image:: ../images/museum/harmonium/rbm_arch.jpg | | :scale: 65% | | :align: center | +-----------------------------------------------------------------+ ``` The most common, simplest harmonium is one where input nodes (one per dimension of the data observation space) are modeled as binary/Boolean sensors -- or "visible units" $\mathbf{z}^0$ (observed variables) that are clamped to actual data patterns -- connected to a layer of (stochastic) binary latent feature detectors -- or "hidden units" $\mathbf{z}^1$ (unobserved or latent variables). Notably, the synaptic connections between the latent and visible units are symmetric. Furthermore, as a result of a key restriction imposed on the harmonium's network structure, i.e., no lateral connections between the neurons/units within $\mathbf{z}^0$ as well as those within $\mathbf{z}^1$, computing the latent and visible states is as straightforward as the following: $$ p(\mathbf{z}^1 | \mathbf{z}^0) &= sigmoid(\mathbf{W} \cdot \mathbf{z}^0 + \mathbf{b}), \; \mathbf{z}^1 \sim p(\mathbf{z}^1 | \mathbf{z}^0) \\ p(\mathbf{z}^0 | \mathbf{z}^1) &= sigmoid(\mathbf{W}^T \cdot \mathbf{z}^1 + \mathbf{c}), \; \mathbf{z}^0 \sim p(\mathbf{z}^0 | \mathbf{z}^1) $$ where $\mathbf{c}$ is the visible bias vector, $\mathbf{b}$ is the latent bias vector, and $\mathbf{W}$ is the synaptic weight matrix that connects $\mathbf{z}^0$ to $\mathbf{z}^1$ (and its transpose $\mathbf{W}^T$ is used to make predictions of the input itself). Note that $\cdot$ means matrix/vector multiplication and $\sim$ denotes that we would sample from a probability (vector). In the above harmonium's case, samples will be drawn treating the conditionals such as $p(\mathbf{z}^1 | \mathbf{z}^0)$ as multivariate Bernoulli distributions. $\mathbf{z}^0$ would typically be clamped/set to the actual sensory input data $\mathbf{x}$. The energy function of the harmonium's joint configuration $(\mathbf{z}^0,\mathbf{z}^1)$ (similar to that of a Hopfield network) is specified as follows: $$ E(\mathbf{z}^0,\mathbf{z}^1) = -\sum_i \mathbf{c}_i \mathbf{z}^0_i - \sum_j \mathbf{b}_j \mathbf{z}^1_j - \sum_i \sum_j \mathbf{z}^0_i \mathbf{W}_{ij} \mathbf{z}^1_j . $$ Notice that, in the equation above, we sum over vector dimension indices, e.g., $\mathbf{z}^0_i$ retrieves the $i$th scalar element of (vector) $\mathbf{z}^0$ while $\mathbf{W}_{ij}$ retrieves the scalar element at position $(i,j)$ within matrix $\mathbf{W}$. With this energy function, one can write out the probability that a harmonium PGM assigns to a data point as: $$ p(\mathbf{z}^0 = \mathbf{x}) = \frac{1}{Z} \exp( -E(\mathbf{z}^0,\mathbf{z}^1) ) $$ where $Z$ is the normalizing constant (or, in statistical mechanics, the partition function) needed to obtain proper probability values[^1]. When one works through the derivation of the gradient of the log probability $\log p(\mathbf{x})$ with respect to the synapses such as $\mathbf{W}$, they get a (contrastive) Hebbian-like update rule as follows: $$ \Delta \mathbf{W}_{ij} = <\mathbf{z}^0_i \mathbf{z}^1_j>_{data} - <\mathbf{z}^0_i \mathbf{z}^1_j>_{model} $$ where the angle brackets $< >$ tell us that we need to take the expectation of the values within the brackets under a certain distribution (such as the data distribution denoted by the subscript $data$). The above rule can also be considered to be a stochastic form of a general recipe known as contrastive Hebbian learning (CHL) [4]. Technically, to compute the update above, obtaining the first term $<\mathbf{z}^0_i \mathbf{z}^1_j>_{data}$ is easy since we only need to take the product of a data point and its corresponding latent state under the harmonium. However, obtaining the second term $<\mathbf{z}^0_i \mathbf{z}^1_j>_{model}$ is very costly, since we would need to initialize the value of $\mathbf{z}^0$ to a random initial state and then run a (block) Gibbs sampler for many iterations to accurately approximate the second term. Fortunately, it was shown in work such as [3], that learning a harmonium is still possible by replacing the term $<\mathbf{z}^0_i \mathbf{z}^1_j>_{model}$ with $<\mathbf{z}^0_i \mathbf{z}^1_j>_{recon}$, which is simply computed by using the first term's latent state $\mathbf{z}^1$ to reconstruct the input and then using this reconstruction once more in order to obtain its corresponding binary latent state. This is known as "contrastive divergence" (CD-1), and, although this approximation has been shown to not actual follow the gradient of any known objective function, it works well in practice when learning a harmonium-based generative model. Finally, the vectorized form of the CD-1 update is: $$ \Delta \mathbf{W} = \Big[ (\mathbf{z}^0_{pos})^T \cdot \mathbf{z}^1_{pos} \Big] - \Big[ (\mathbf{z}^0_{neg})^T \cdot \mathbf{z}^1_{neg} \Big] $$ where the first term (in brackets) is labeled as the "positive phase" (or the positive, data-dependent statistics -- where $\mathbf{z}^0_{pos}$ denotes the positive phase sample of $\mathbf{z}^0$) while the second term is labeled as the "negative phase" (or the negative, data-independent statistics -- where $\mathbf{z}^0_{neg}$ denotes the negative phase sample of $\mathbf{z}^0$). Note that simpler rules of a similar form can be worked out for the latent/visible bias vectors as well. In NGC-Learn, to simulate the above harmonium PGM and its CD-1 update, we will model the positive and negative phases as simulated co-models, each responsible for producing the relevant statistics that we will require in order to adjust synapses. Additionally, we will find that we can further re-purpose the created co-models to construct a block Gibbs sampler for confabulating "fantasized" data patterns from a harmonium that has been fit to data. ## Boltzmann Machines: Positive and Negative Co-Models We begin by first specifying the structure of the harmonium system that we would like to simulate. In NGC shorthand, the above positive and negative phase graphs would simply be (under one complete generative model): ``` z0 -(z0-z1)-> z1 z1 -(z1-z0) -> z0 Note: z1-z0 = (z0-z1)^T (transpose-tied synapses) ``` In order to construct the desired harmonium, particularly the structure needed to implement CD-1, we will need to break up the model into its key "phases", i.e., a positive phase and a negative phase. We will model each phase as its own simulated nodes-and-cables structure within one single model context, allowing us to craft a general approach that permits a CD-based learning. Notably, we will use the negative-phase co-model to emulate the crucial MCMC sampling step to synthesize data from the trained RBM. Building the positive phase of our harmonium can be done as follows: ```python with Context("Circuit") as self.circuit: ## set up positive-phase graph self.z0 = BernoulliStochasticCell("z0", n_units=obs_dim, is_stoch=False) self.z1 = BernoulliStochasticCell("z1", n_units=hid_dim, key=subkeys[0]) self.W1 = HebbianSynapse( "W1", shape=(obs_dim, hid_dim), eta=0., weight_init=dist.gaussian(mean=0., std=sigma), bias_init=dist.constant(value=0.), w_bound=0., optim_type="sgd", sign_value=1., key=subkeys[1] ) ## wire up z0 to z1 via synaptic project W1 self.z0.s >> self.W1.inputs self.W1.outputs >> self.z1.inputs ``` To gather the rest of the statistics that we require, we will need to build the negative phase of our model (which is responsible for "dreaming up" or "confabulating" data samples from its internal model of the world). Constructing the negative-phase co-model, under the same model `Context` above can be done as follows: ```python ## set up negative-phase graph self.z0neg = BernoulliStochasticCell("z0neg", n_units=obs_dim, key=subkeys[3]) self.z1neg = BernoulliStochasticCell("z1neg", n_units=hid_dim, key=subkeys[4]) self.E1 = DenseSynapse( ## E1 = W1.T "E1", shape=(hid_dim, obs_dim), weight_init=dist.gaussian(mean=0., std=sigma), bias_init=dist.constant(value=0.), resist_scale=1., key=subkeys[2] ) self.E1.weights.set(self.W1.weights.get().T) self.V1 = HebbianSynapse( ## V1 = W1 "V1", shape=(obs_dim, hid_dim), eta=0., weight_init=dist.gaussian(mean=0., std=sigma), bias_init=None, w_bound=0., optim_type="sgd", sign_value=1., key=subkeys[1] ) self.V1.weights.set(self.W1.weights.get()) self.V1.biases.set(self.W1.biases.get()) ## wire up z1 to z0(neg) via E1=(W1)^T, and z0(neg) to z1(neg) via V1=W1 self.z1.s >> self.E1.inputs self.E1.outputs >> self.z0neg.inputs self.z0neg.p >> self.V1.inputs ## drive hiddens by probs of visibles self.V1.outputs >> self.z1neg.inputs ``` The above chunk of code effectively sets up the propagation of information from the latent neurons `z1` back down to `z0` (obtaining the negative phase values of `z0`, i.e., `z0neg`) and then the propagation of the reconstructed values back up to `z1` one last time (obtaining the negative phase values of `z1`, i.e., `z0neg`). To build a CHL-based form of plasticity, allowing us to build the CD-1 learning process, we will then need to wire up a set of 2-factor Hebbian rules like so: ```python ## set up contrastive Hebbian learning rule (pos-stats - neg-stats) self.z0.s >> self.W1.pre ## positive-phase pre-synaptic term self.z1.p >> self.W1.post ## positive-phase post-synaptic term self.z0neg.p >> self.V1.pre ## negative-phase pre-synaptic term self.z1neg.p >> self.V1.post ## negative-phase pre-synaptic term ``` the results of these two Hebbian rules are then used in an exhibit-specific function (`_update_via_CHL()`) written in the [`Harmonium` class](https://github.com/NACLab/ngc-museum/blob/v3/exhibits/harmonium/harmonium.py). While we observe that our "negative phase" co-model allows us to emulate the CD learning recipe[^2], technically, the negative phase of a harmonium should be run for a very high value of steps (approaching infinity) in order to obtain a proper sample from the PGM's equilibrium/steady state distribution. However, this would be extremely costly to simulate and, as early studies [3] observed, often only a few or even a single step of this Markov chain proved to work quite well, approximating the contrastive divergence objective (the learning algorithm's namesake) instead of direct maximum likelihood. Note that the full code, containing the snippets above, can be found in the Model Museum `Harmonium` model structure class. One could further generalize our CD-1 framework to variations, such as "persistent" CD (where we, instead of running `z1` back down through `E1` synapses, we inject random noise instead (to sample the harmonium's latent prior), or even an algorithm known as parallel tempering, where we would maintain multiple co-models and draw samples from all of them to obtain negative-phase statistics. Finally, within the `Harmonium` class, we have written a routine for drawing samples from the model directly, i.e., we implement a block Gibbs sampler in order synthesize data from the RBM's current set of parameters. ## Using the Harmonium to Dream Up Handwritten Digits We finally take the harmonium that we have constructed above and fit it to some MNIST digits. Specifically, we will leverage the [Harmonium](https://github.com/NACLab/ngc-museum/blob/v3/exhibits/harmonium/harmonium.py), model in the Model Museum since it implements all of the above core mechanisms (and more) internally. In the script `sim_harmonium.py`, you will find a general training that will fit our harmonium to the MNIST database (unzip the file `mnist.zip` in the `ngc-museum/exhibits/data/` directory if you have not already) by cycling through it several times, saving the final (best) resulting to disk within the `exp/` sub-directory. Go ahead and execute the training process as follows: ```console $ python sim_harmonium.py ``` which will fit/adapt your harmonium to MNIST. Note that the model exhibit code that you will run uses a special extension of CD learning known as persistent CD (PCD); our PCD implementation[^3] essentially obtains negative-phase statistics by maintaining a set of Gibbs sampling chains that are never reset but instead sampled from each time the model parameters are to updated [6] (this extension that improves the quality of the samples produced by the RBM). This should produce per-training iteration output, printed to I/O, similar to the following: ```console --- Initial RBM Synaptic Stats --- W1: min -0.0494 ; max 0.0445 mu -0.0000 ; norm 4.4734 b1: min -4.0000 ; max -4.0000 mu -4.0000 ; norm 64.0000 c0: min -15.2663 ; max 0.1887 mu -4.0560 ; norm 148.4289 -1| Test: err(X) = 66.7563 0| Dev: |d.E(X)| = 10.0093 err(X) = 64.7762 1| Dev: |d.E(X)| = 2.5509 err(X) = 57.7121 2| Dev: |d.E(X)| = 5.0427 err(X) = 53.9887 3| Dev: |d.E(X)| = 5.1724 err(X) = 52.6923 4| Dev: |d.E(X)| = 5.0167 err(X) = 51.1648 5| Dev: |d.E(X)| = 3.4010 err(X) = 49.9060 6| Dev: |d.E(X)| = 1.2844 err(X) = 48.9477 7| Dev: |d.E(X)| = 3.8469 err(X) = 48.2278 8| Dev: |d.E(X)| = 3.2666 err(X) = 47.3158 9| Dev: |d.E(X)| = 0.7140 err(X) = 46.4883 10| Dev: |d.E(X)| = 3.5822 err(X) = 45.7021 11| Dev: |d.E(X)| = 1.9054 err(X) = 45.2206 ... ... 93| Dev: |d.E(X)| = 0.3789 err(X) = 27.3184 94| Dev: |d.E(X)| = 0.5906 err(X) = 27.2172 95| Dev: |d.E(X)| = 0.0461 err(X) = 27.2518 96| Dev: |d.E(X)| = 1.9164 err(X) = 27.1477 97| Dev: |d.E(X)| = 2.3997 err(X) = 27.0035 98| Dev: |d.E(X)| = 2.9253 err(X) = 27.1244 99| Dev: |d.E(X)| = 1.2569 err(X) = 26.9761 --- Final RBM Synaptic Stats --- W1: min -1.1823 ; max 0.7636 mu -0.0087 ; norm 57.4068 b1: min -4.0943 ; max -2.8031 mu -3.5501 ; norm 56.9961 c0: min -16.0370 ; max -0.8244 mu -4.6686 ; norm 158.2293 ``` You will find, after the training script has finished executing, several outputs in the `exp/filters/` model sub-directory that is created for you. Concretely, you will find a grid-plot of the (first `100` of the) harmonium's acquired filters (or "receptive fields"), much as we did for the sparse coding exhibit, that will look similar to the following: Interestingly enough, we see that our harmonium/RBM has extracted what appears to be rough stroke features, which is what it uses when sampling its binary latent feature detectors to compose final synthesized image patterns (each binary feature detector serves as Boolean function that emits a decision of `1` if the filter is to be used and a `0` if not). In particular, we remark notice that the filters that our harmonium has acquired are a bit more prominent due to the fact our exhibit employs some weight decay (specifically, Gaussian/L2 decay -- with intensity `l2_lambda=0.01` -- to the `W1` synaptic matrix of our RBM). Weight decay of this form is particularly useful to not only mitigate against the harmonium overfitting to its training data but also to ensure that the Markov chain inherent to its negative-phase mixes more effectively [5] (which ensures better-quality samples from the block Gibbs sampler, which we will use next). Finally, you will also find in the `exp/filters/` model sub-folder another grid-plot containing some (about `100`) of the RBM's reconstructions of held-out development data. This plot should look similar to the one below: ### Sampling the Harmonium Once the training process has completed, you can then run the following to sample from trained model using block Gibbs sampling: ```console $ python sample_harmonium.py ``` which will take your trained harmonium's negative-phase co-model and use it to synthesize some digit patterns. You should see inside the `exp/samples/` sub-directory three sample-image grids (i.e., `samples_0.jpg`, `samples_1.jpg`, and `samples_2.jpg`) similar to what is shown below: ```{eval-rst} .. image:: ../images/museum/harmonium/samples_0.jpg :width: 30% .. image:: ../images/museum/harmonium/samples_1.jpg :width: 30% .. image:: ../images/museum/harmonium/samples_2.jpg :width: 30% ``` Furthermore, you will see three corresponding GIFs that have been generated for you that visualize how each of the three simulated sampling Markov chains change with time (i.e., these are the files: `markov_chain_0.gif`, `markov_chain_1.gif`, and `markov_chain_2.gif`). It is important to understand that the three grids of samples shown above come from particular points in the block Gibbs sampling process. (Note that one reads these sample grid plots left-column to right-column, and top-row to bottom-row; this way of reading the plot follows the ordering of samples extracted from the specific Markov chain sequence.) Note that, although each chain is run for many total steps, the `sample_harmonium.py` script "thins" out each Markov chain by only pulling out a fantasized pattern every `20` steps (further "burning" in each chain before collecting samples). We remark that higher-quality samples can be obtained if one modifies the earlier harmonium to learn with more advanced forms of CD-learning, such as parallel tempering. ### Final Notes The harmonium that we have built in this exhibit is a classical Bernoulli harmonium/RBM, which is a neural PGM that assumes that the input data features are binary in nature. If one wants to model data that is continuous/real-valued, then the harmonium model above would need to be extended to utilize visible units that follow a continuous distribution; for instance, if one modeled a multivariate Gaussian distribution, this would yield a Gaussian restricted Boltzmann machine (GRBM). ## References [1] Smolensky, P. "Information Processing in Dynamical Systems: Foundations of Harmony Theory" (Chapter 6). Parallel distributed processing: explorations in the microstructure of cognition 1 (1986).
[2] Hinton, Geoffrey. Products of Experts. International conference on artificial neural networks (1999).
[3] Hinton, Geoffrey E. "Training products of experts by maximizing contrastive likelihood." Technical Report, Gatsby computational neuroscience unit (1999).
[4] Movellan, Javier R. "Contrastive Hebbian learning in the continuous Hopfield model." Connectionist models. Morgan Kaufmann, 1991. 10-17.
[5] Hinton, Geoffrey E. "A practical guide to training restricted Boltzmann machines." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg. 599-619 (2012).
[6] Tieleman, Tijmen. "Training restricted Boltzmann machines using approximations to the likelihood gradient." Proceedings of the 25th international conference on Machine learning. (2008). [^1]: In fact, it is intractable to compute the partition function $Z$ for any reasonably-sized harmonium; fortunately, we will not need to calculate $Z$ in order to learn and sample from a Harmonium. [^2]: In general, CD-1 means contrastive divergence where the negative phase is only run for one single step, i.e., `K=1`. The more general form of CD is known as CD-K, the K-step CD algorithm where `K > 1`. (Sometimes, CD-1 is just referred to as just "CD".) [^3]: Note that we have slightly modified the PCD algorithm to include a "chain-swapping" mechanism taken from the statistical approach known as "parallel tempering". In our implementation, we randomy swap the states of the set of Gibbs chains we maintain under a fixed mixing probability `p_mix`; we found that this somewhat improved the quality of our model's confabulations more consistently.