r"""
.. _apps-training-tutorial:
Training variational GBS distributions
======================================
*Technical details are available in the API documentation:*
:doc:`code/api/strawberryfields.apps.train`
Many quantum algorithms rely on the ability to train the parameters of quantum circuits, a
strategy inspired by the success of neural networks in machine learning. Training is often
performed by evaluating gradients of a cost function with respect to circuit parameters,
then employing gradient-based optimization methods. In this demonstration, we outline the
theoretical principles for training Gaussian Boson Sampling (GBS) circuits, first
introduced in Ref. [[#banchi2020training]_]. We then explain how to employ Strawberry Fields
to perform the training by looking at basic examples in stochastic optimization and unsupervised
learning. Let's go! 🚀
Theory
------
As explained in more detail in :doc:`/concepts/gbs`, for a GBS device, the probability
:math:`\Pr(S)` of observing an output :math:`S=(s_1, s_2, \ldots, s_m)`, where :math:`s_i` is the
number of photons detected in the :math:`i`-th mode, is given by
.. math::
\Pr(S) = \frac{1}{\mathcal{N}} \frac{|\text{Haf}(A_S)|^2}{
s_1!\ldots s_m!},
where :math:`\mathcal{N}` is a normalization constant and :math:`A` is an arbitrary symmetric
matrix with eigenvalues bounded between :math:`-1` and :math:`1`. The matrix
:math:`A` can also be rescaled by a constant factor, which is equivalent to fixing a total
mean photon number in the distribution.
We want to *train* this distribution to perform a specific task. For example, we may wish to
reproduce the statistical properties of a given dataset, or to optimize the circuit to sample
specific patterns with high probability. The strategy is to identify a parametrization of
the distribution and compute gradients that allow us to optimize the parameters using
gradient-based techniques. We refer to these circuits as variational GBS circuits, or VGBS for short.
Gradient formulas can be generally challenging to calculate, but there is a particular
strategy known as the WAW parametrization (wow! what a great name) that leads to gradients that
are simpler to compute. It involves transforming the symmetric matrix :math:`A` as
.. math::
A \rightarrow A_W = W A W,
where :math:`W = \text{diag}(\sqrt{w_1}, \sqrt{w_2}, \ldots, \sqrt{w_m})` is a diagonal weight matrix and :math:`m`
is the number of modes. This parametrization is useful because the hafnian of :math:`A_W`
factorizes into two separate components
.. math::
\text{Haf}(A_W) = \text{Haf}(A)\text{det}(W),
a property that can be cleverly exploited to compute gradients more efficiently. More broadly,
it is convenient to embed trainable parameters :math:`\theta = (\theta_1, \ldots, \theta_d)` into
the weights :math:`w_k`. Several choices are possible, but here we focus on an exponential embedding
.. math::
w_k = \exp(-\theta^T f^{(k)}),
where each :math:`f^{(k)}` is a :math:`d`-dimensional vector. The simplest case occurs by setting
:math:`d=m` and choosing these vectors to satisfy :math:`\theta^T f^{(k)} = \theta_k` such that
:math:`w_k = \exp(-\theta_k)`.
Training tasks
^^^^^^^^^^^^^^
In stochastic optimization, we are given a function :math:`h(S)` and the goal is optimize the
parameters to sample from a distribution :math:`P_{\theta}(S)` that minimizes the
expectation value
.. math::
C (\theta) = \sum_{S} h(S) P_{\theta}(S).
As shown in [[#banchi2020training]_], the gradient of the cost function :math:`C (\theta)` is given
by
.. math::
\partial_{\theta} C (\theta) = \sum_{S} h(S) P_{\theta}(S)
\sum_{k=1}^{m} (s_k - \langle s_{k} \rangle) \partial_{\theta} \log w_{k},
where :math:`\langle s_k\rangle` denotes the average photon number in mode :math:`k`. This gradient
is an
expectation value with respect to the GBS distribution, so it can be estimated by generating
samples from the device. Convenient!
In a standard unsupervised learning scenario, data are assumed to be sampled from an unknown
distribution and a common goal is to learn that distribution. More precisely, the goal is
to use the data to train a model that can sample from a similar distribution, thus being able to
generate new data. Training can be performed by minimizing the Kullback-Leibler (KL) divergence,
which up to additive constants can be written as:
.. math::
KL(\theta) = -\frac{1}{T}\sum_S \log[P_{\theta}(S)].
In this case :math:`S` is an element of the data, :math:`P(S)` is the probability of observing that
element when sampling from the GBS distribution, and :math:`T` is the total number of elements
in the data. For the GBS distribution in the WAW parametrization, the gradient of the KL
divergence can be written as
.. math::
\partial_\theta KL(\theta) = - \sum_{k=1}^m\frac{1}{w_k}(\langle s_k\rangle_{\text{data}}-
\langle s_k\rangle_{\text{GBS}})\partial_\theta w_k.
Remarkably, this gradient can be evaluated without a quantum computer because the
mean photon numbers :math:`\langle s_k\rangle_{\text{GBS}}` over the GBS distribution can be
efficiently computed classically [[#banchi2020training]_]. As we'll show later, this leads to
very fast training. This is true even if sampling the distribution remains classically
intractable! 🤯
Stochastic optimization
-----------------------
We're ready to start using Strawberry Fields to train GBS distributions. The main functions needed
can be found in the :mod:`~strawberryfields.apps.train` module, so let's start by importing it.
"""
from strawberryfields.apps import train
##############################################################################
# We explore a basic example where the goal is to optimize the distribution to favour photons
# appearing in a specific subset of modes, while minimizing the number of photons in the
# remaining modes. This can be achieved with the following cost function
import numpy as np
def h(s):
not_subset = [k for k in range(len(s)) if k not in subset]
return sum(s[not_subset]) - sum(s[subset])
##############################################################################
# The function is defined with respect to the subset of modes for which we want to
# observe many photons. This subset can be specified later on. Then, for a given sample ``s``,
# we want to *maximize* the total number of photons in the subset. This can be achieved by
# minimizing its negative value, hence the term ``-sum(s[subset])``. Similarly, for modes
# outside of the specified subset, we want to minimize their total sum, which explains the
# appearance of ``sum(s[not_subset])``.
#
# It's now time to define the variational circuit. We'll train a distribution based on
# a simple lollipop 🍭 graph with five nodes:
import networkx as nx
from strawberryfields.apps import plot
graph = nx.lollipop_graph(3, 2)
A = nx.to_numpy_array(graph)
plot.graph(graph)
##############################################################################
# Defining a variational GBS circuit consists of three steps: (i) specifying the embedding,
# (ii) building the circuit, and (iii) defining the cost function with respect to the circuit and
# embedding. We'll go through each step one at a time. For the embedding of trainable parameters,
# we'll use the simple form :math:`w_k = \exp(-\theta_k)` outlined above, which can be accessed
# through the :class:`~strawberryfields.apps.train.Exp` class. Its only input is the number of
# modes in the device, which is equal to the number of nodes in the graph.
nr_modes = len(A)
weights = train.Exp(nr_modes)
##############################################################################
# Easy! The GBS distribution is determined by the symmetric matrix :math:`A` --- which we
# train using the WAW parametrization --- and by the total mean photon number. There is freedom
# in choosing :math:`A`, but here we'll just use the graph's adjacency matrix. The total
# mean photon number is a hyperparameter of the distribution: in general, different choices may
# lead to different results in training. In fact, the mean photon number may change during
# training as a consequence of the weights being optimized. Finally, GBS devices can operate either
# with photon number-resolving detectors or threshold detectors, so there is an option to specify
# which one we intend to use. We'll stick to detectors that can count photons.
n_mean = 6
vgbs = train.VGBS(A, n_mean, weights, threshold=False)
##############################################################################
# The last step before training is to define the cost function with respect to our
# previous choices. Since this is a stochastic optimization task, we employ the
# :class:`~strawberryfields.apps.train.Stochastic` class and input our previously defined cost
# function ``h``.
cost = train.Stochastic(h, vgbs)
##############################################################################
# During training, we'll calculate gradients and evaluate the average of this cost function with
# respect to the GBS distribution. Both of these actions require estimating expectation values, so
# the number of samples in the estimation also needs to be specified. The parameters also need to be
# initialized. There is freedom in this choice, but here we'll set them all to zero. Finally, we'll
# aim to increase the number of photons in the "candy" part of the lollipop graph,
# which corresponds to the subset of modes ``[0, 1, 2]``.
np.random.seed(1969) # for reproducibility
d = nr_modes
params = np.zeros(d)
subset = [0, 1, 2]
nr_samples = 100
print('Initial mean photon numbers = ', vgbs.mean_photons_by_mode(params))
##############################################################################
# If training is successful, we should see the mean photon numbers of the first three modes
# increasing, while those of the last two modes become close to zero. We perform training over
# 200 steps of gradient descent with a learning rate of 0.01:
nr_steps = 200
rate = 0.01
for i in range(nr_steps):
params -= rate * cost.grad(params, nr_samples)
if i % 50 == 0:
print('Cost = {:.3f}'.format(cost.evaluate(params, nr_samples)))
print('Final mean photon numbers = ', vgbs.mean_photons_by_mode(params))
##############################################################################
# Great! The cost function decreases smoothly and there is a clear increase in the mean photon
# numbers of the target modes, with a corresponding decrease in the remaining modes.
#
# The transformed matrix :math:`A_W = W A W` also needs to have eigenvalues bounded between
# -1 and 1, so continuing training indefinitely can lead to unphysical distributions when
# the weights become too large. It's important to monitor this behaviour. We can confirm that the
# trained model is behaving according to plan by generating some samples. Although we still
# observe a couple of photons in the last two modes, most of the detections happen in the first
# three modes that we are targeting, just as intended.
Aw = vgbs.A(params)
samples = vgbs.generate_samples(Aw, n_samples=10)
print(samples)
##############################################################################
# Unsupervised learning
# ---------------------
# We are going to train a circuit based on the pre-generated datasets in
# the :mod:`~strawberryfields.apps.data` module. Gradients in this setting can be calculated
# efficiently, so we can study a larger graph with 30 nodes.
#
from strawberryfields.apps import data
data_pl = data.Planted()
data_samples = data_pl[:1000] # we use only the first one thousand samples
A = data_pl.adj
nr_modes = len(A)
##############################################################################
# As before, we use the exponential embedding, but this time we define the VGBS circuit in
# threshold mode, since this is how the data samples were generated. The cost function is the
# Kullback-Liebler divergence, which depends on the data samples and can be accessed
# using :class:`~strawberryfields.apps.train.KL`:
weights = train.Exp(nr_modes)
n_mean = 1
vgbs = train.VGBS(A, n_mean, weights, threshold=True)
cost = train.KL(data_samples, vgbs)
##############################################################################
# We initialize parameters to zero and perform a longer optimization over one thousand
# steps with a learning rate of 0.15. This will allow us to reach a highly-trained model.
# Gradients can be computed efficiently, but evaluating the cost function is challenging
# because it requires calculating GBS probabilities, which generally take exponential time.
# Instead, we'll keep track of the differences in mean photon numbers per mode for the data and
# model distributions.
from numpy.linalg import norm
params = np.zeros(nr_modes)
steps = 1000
rate = 0.15
for i in range(steps):
params -= rate * cost.grad(params)
if i % 100 == 0:
diff = cost.mean_n_data - vgbs.mean_clicks_by_mode(params)
print('Norm of difference = {:.5f}'.format(norm(diff)))
##############################################################################
# Wow! WAW! We reach almost perfect agreement between the data and the trained model. We can also
# generate a few samples, this time creating photon patterns that, in general,
# are not originally in the training data.
Aw = vgbs.A(params)
samples = vgbs.generate_samples(Aw, n_samples=10)
print(samples)
##############################################################################
# The examples we have covered are introductory tasks aimed at mastering the basics of training
# variational GBS circuits. These ideas are new and there is much left to explore in terms of
# their scope and extensions. For example, the original paper [[#banchi2020training]_] studies how
# GBS devices can be trained to find solutions to max clique problems. What new applications come
# to your mind?
#
# References
# ----------
#
# .. [#banchi2020training]
#
# Leonardo Banchi, Nicolás Quesada, and Juan Miguel Arrazola. Training Gaussian Boson
# Sampling Distributions. arXiv:2004.04770. 2020.