Welcome to MXFusion’s documentation!¶
MXFusion is a library for integrating probabilistic modelling with deep learning.
MXFusion helps you rapidly build and test new methods at scale, by focusing on the modularity of probabilistic models and their integration with modern deep learning techniques.
Installation¶
Dependencies / Prerequisites¶
MXFusion’s primary dependencies are MXNet >= 1.2 and Networkx >= 2.1. See requirements.
Supported Architectures / Versions¶
MXFusion is tested on Python 3.5+ on MacOS and Amazon Linux.
pip¶
If you just want to use MXFusion and not modify the source, you can install through pip:
pip install mxfusion
From source¶
To install MXFusion from source, after cloning the repository run the following from the top-level directory:
pip install .
Tutorials!¶
Below is a list of tutorial / example notebooks demonstrating MXFusion’s functionality.
Getting Started¶
Zhenwen Dai (2018.10.22)
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
# ==============================================================================
Introduction¶
MXFusion is a probabilistic programming language. It provides a convenient interface for designing probabilistic models and applying them to real world problems.
Probabilistic models describe the relationships in data through probabilistic distributions of random variables. Probabilistic modeling is typically done by stating your prior belief about the data in terms of a probabilistic model and performing inference with the observations of some of the random variables.
[1]:
import warnings
warnings.filterwarnings('ignore')
import os
os.environ['MXNET_ENGINE_TYPE'] = 'NaiveEngine'
A Simple Example¶
Let’s start with a toy example about estimating the mean and variance of a set of data. For simplicity, we generate 100 data points with a given mean and variance following a normal distribution.
[2]:
import numpy as np
np.random.seed(0)
mean_groundtruth = 3.
variance_groundtruth = 5.
N = 100
data = np.random.randn(N)*np.sqrt(variance_groundtruth) + mean_groundtruth
Let’s visualize our data by building a histogram.
[3]:
%matplotlib inline
from pylab import *
_=hist(data, 10)

Now, let’s pretend that we do not know the mean and variance that are used to generate the above data.
We still believe that the data come from a normal distribution, which is our model. It is formulated as
In MXFusion, the above model can be defined as follows:
[4]:
from mxfusion import Variable, Model
from mxfusion.components.variables import PositiveTransformation
from mxfusion.components.distributions import Normal
from mxfusion.common import config
config.DEFAULT_DTYPE = 'float64'
m = Model()
m.mu = Variable()
m.s = Variable(transformation=PositiveTransformation())
m.Y = Normal.define_variable(mean=m.mu, variance=m.s, shape=(N,))
In the above definition, we start with defining a model by instantiated from the class Model. The variable \(\mu\) and \(s\) are created from the class Variable. Both of them are assigned as members of the model instance m. This is how variables are organized in MXFusion. The variable s is created by passing a PositiveTransformation instance to the transforamtion argument. This constrains the value of the variable s to be positive through a “soft-plus” transformation. The variable Y is created from a normal distribution by specifying the mean and variance and its shape.
Note that, in this example, the mean and variance variable are both scalar, with the shape (1,), while the random variable Y has the shape (100,). This indicates the mean and variance variable are broadcasted into the shape of the random variable, just like the broadcasting rule in numpy array operation. In this case, this means the individual entries of the random variable Y follows a scalar normal distribution with the same mean and variance.
To list the content that is defined in the model instance, just print the model instance as follows:
[5]:
print(m)
Y ~ Normal(mean=mu, variance=s)
After defining the probabilistic model, we want to estimate the mean and variance of the normal distribution in our model conditioned on the data that we generated. In MXFusion, this is done by creating an inference algorithm and passing it into the creation of an Inference instance. An inference algorithm represents a specific algorithm for a probabilistic inference. In this example, we performs a maximum likelihood estimate by using the MAP class. The Inference class takes care of the initialization of parameters and the execution of inference.
In the following code, we created a MAP inference algorithm by specifying the model and the set of observed variable. Then, we created a GradBasedInference instance from the instantiated MAP infernece algorithm.
The execution of inference is done by calling the call function. The call function takes all observed data (specified when creating the inference algorithm) as the keyword arguments, where the keys are the names of the member variables of the model and the values are the corresponding MXNet NDArrays. In this example, we only observed the variable Y, then, we pass “Y” as the key and the generated data as the value. We also specify the configuration parameters for the gradient optimizer such as the learning rate, the maximum number of iterations and whether to print the optimization progress. The default optimizer is adam.
[6]:
from mxfusion.inference import GradBasedInference, MAP
import mxnet as mx
infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.Y]))
infr.run(Y=mx.nd.array(data, dtype='float64'), learning_rate=0.1, max_iter=2000, verbose=True)
Iteration 201 loss: 226.00307424753382
Iteration 401 loss: 223.62512496838366
Iteration 601 loss: 223.23108001422028
Iteration 801 loss: 223.16242835266598
Iteration 1001 loss: 223.15215875874755
Iteration 1201 loss: 223.15098355232075
Iteration 1401 loss: 223.15088846215527
Iteration 1601 loss: 223.15088333213185
Iteration 1801 loss: 223.15088315658113
Iteration 2000 loss: 223.15088315295884
After optimization, the estimated parameters are stored in an instance of the class InferenceParameters, which can be access from an Inference instance by infr.params.
We collect the estimated mean and variance and compared with the generating parameters.
[7]:
mean_estimated = infr.params[m.mu].asnumpy()
variance_estimated = infr.params[m.s].asnumpy()
print('The estimated mean and variance: %f, %f.' % (mean_estimated, variance_estimated))
print('The true mean and variance: %f, %f.' % (mean_groundtruth, variance_groundtruth))
The estimated mean and variance: 3.133735, 5.079126.
The true mean and variance: 3.000000, 5.000000.
The estimated parameters are close to the generating parameters, but still off by a small amount. This difference is due to the small size of dataset we used, a problem known as over-fitting.
A Bayesian model¶
From the above example, we have done a maximum likelihood estimate from the observed data. Due to the limited number of data, the estimated parameters are not the same as the true parameters. An interesting question here is that whether we can have an estimate about how big the difference is. One approach to provide such an estimate is via Bayesian inference.
Following the above example, we need to assume prior distributions for the mean and variance of the normal distribution. We assume the mean to be a normal distribution with a relative big variance, indicating that we do not have much knowledge about the parameter.
[8]:
m = Model()
m.mu = Normal.define_variable(mean=mx.nd.array([0], dtype='float64'),
variance=mx.nd.array([100], dtype='float64'), shape=(1,))
Then, we need to specify a prior distribution for the variance. This is a bit more complicated as the variance needs to be positive. In principle, one can use a distribution of positive values such as the Gamma distribution. To enable inference with the reparameterization trick, we, instead, assume a random variable \(\hat{s}\) with a normal distribution and the variance \(s\) is a function of \(\hat{s}\),
To implement the above prior in MXFusion, we first create the variable s_hat with a normal distribution. Then, we defines a function in the MXNet Gluon syntax, which is also called a Gluon block, for the “soft-plus” transformation. The MXNet function is brought into the MXFusion environment by applying a wrapper called MXFusionGluonFunction, in which we specify the number of outputs. We pass the variable s_hat as the input to the function and get the variable s as the return value.
[9]:
from mxfusion.components.functions import MXFusionGluonFunction
m.s_hat = Normal.define_variable(mean=mx.nd.array([5], dtype='float64'),
variance=mx.nd.array([100], dtype='float64'),
shape=(1,), dtype=dtype)
trans_mxnet = mx.gluon.nn.HybridLambda(lambda F, x: F.Activation(x, act_type='softrelu'))
m.trans = MXFusionGluonFunction(trans_mxnet, num_outputs=1, broadcastable=True)
m.s = m.trans(m.s_hat)
We define the variable Y following a normal distribution with the mean mu and the variance s.
[10]:
m.Y = Normal.define_variable(mean=m.mu, variance=m.s, shape=(N,), dtype=dtype)
print(m)
s_hat ~ Normal(mean=Variable(66591), variance=Variable(582f6))
s = GluonFunctionEvaluation(hybridlambda0_input_0=s_hat)
mu ~ Normal(mean=Variable(230f5), variance=Variable(e5c1f))
Y ~ Normal(mean=mu, variance=s)
Inference for the above model is more complex, as the exact inference is intractable. We use variational inference with a Gaussian mean field posterior.
We construct the variational posterior by calling the function create_Gaussian_meanfield, which defines a Gaussian distribution for both the mean and the variance as the variational posterior. The content in the generated posterior can be listed by printing the posterior.
[11]:
from mxfusion.inference import create_Gaussian_meanfield
q = create_Gaussian_meanfield(model=m, observed=[m.Y])
print(q)
s_hat ~ Normal(mean=Variable(987b5), variance=Variable(bf47c))
mu ~ Normal(mean=Variable(9c88b), variance=Variable(3ce74))
Then, we created an instance of StochasticVariationalInference with both the model and the variational posterior. We also need to specify the number of samples used in inference, as it uses the Monte Carlo method for approximating the integral in the variational lower bound. The execution of inference follows the same interface.
[12]:
from mxfusion.inference import StochasticVariationalInference
infr = GradBasedInference(inference_algorithm=StochasticVariationalInference(
model=m, posterior=q, num_samples=10, observed=[m.Y]))
infr.run(Y=mx.nd.array(data, dtype='float64'), learning_rate=0.1, verbose=True)
Iteration 201 loss: 234.52798823187217
Iteration 401 loss: 231.30663180752714
Iteration 601 loss: 230.14185436095775
Iteration 801 loss: 230.02993763459781
Iteration 1001 loss: 229.6937452192292
Iteration 1201 loss: 229.72574317072662
Iteration 1401 loss: 229.65264175269712
Iteration 1601 loss: 229.67285188868293
Iteration 1801 loss: 229.52906307607037
Iteration 2000 loss: 229.64091981034755
Let’s check the resulting posterior distribution.
[13]:
mu_mean = infr.params[q.mu.factor.mean].asscalar()
mu_std = np.sqrt(infr.params[q.mu.factor.variance].asscalar())
s_hat_mean = infr.params[q.s_hat.factor.mean].asscalar()
s_hat_std = np.sqrt(infr.params[q.s_hat.factor.variance].asscalar())
s_15 = np.log1p(np.exp(s_hat_mean - s_hat_std))
s_50 = np.log1p(np.exp(s_hat_mean))
s_85 = np.log1p(np.exp(s_hat_mean + s_hat_std))
print('The mean and standard deviation of the mean parameter is %f(%f). ' % (mu_mean, mu_std))
print('The 15th, 50th and 85th percentile of the variance parameter is %f, %f and %f.'%(s_15, s_50, s_85))
The mean and standard deviation of the mean parameter is 3.120117(0.221690).
The 15th, 50th and 85th percentile of the variance parameter is 4.604521, 5.309114 and 6.016289.
The true parameter sits within one standard deviation of the estimated posterior distribution for both the mean and variance parameters. The above error gives a good indication about how much we could trust the parameters that we estimate.
Probabilistic PCA Tutorial¶
This tutorial will demonstrate Probabilistic PCA, a factor analysis technique.
Maths and notation following Machine Learning: A Probabilistic Perspective.
Installation¶
Follow the instrallation instructions in the README file to get setup.
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
# ==============================================================================
Probabalistic Modeling Introduction¶
Probabilistic Models can be categorized into directed graphical models (DGM, Bayes Net) and undirected graphical models (UGM). Most popular probabilistic models are DGMs, so MXFusion will only support the definition of DGMs unless there is a strong customer need of UGMs in future.
A DGM can be fully defined using 3 basic components: deterministic functions, probabilistic distributions, and random variables. We show the interface for defining a model using each of the three components below.
First lets import the basic libraries we’ll need to train our model and visualize some data.
[1]:
import warnings
warnings.filterwarnings('ignore')
import mxfusion as mf
import mxnet as mx
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Data Generation¶
We’ll take as our function to learn components of the log spiral function because it’s 2-dimensional and easy to visualize.
[2]:
def log_spiral(a,b,t):
x = a * np.exp(b*t) * np.cos(t)
y = a * np.exp(b*t) * np.sin(t)
return np.vstack([x,y]).T
We parameterize the function with 100 data points and plot the resulting function.
[3]:
N = 100
D = 100
K = 2
a = 1
b = 0.1
t = np.linspace(0,6*np.pi,N)
r = log_spiral(a,b,t)
[4]:
r.shape
[4]:
(100, 2)
[5]:
plt.plot(r[:,0], r[:,1],'.')
[5]:
[<matplotlib.lines.Line2D at 0x1a1cbf4d68>]

We now project our \(K\) dimensional r
into a high-dimensional \(D\) space using a random matrix of random weights \(W\). Now that r
is embedded in a \(D\) dimensional space the goal of PPCA will be to recover r
in it’s original low-dimensional \(K\) space.
[6]:
w = np.random.randn(K,N)
x_train = np.dot(r,w) + np.random.randn(N,N) * 1e-3
[7]:
# from sklearn.decomposition import PCA
# pca = PCA(n_components=2)
# new_r = pca.fit_transform(x_train)
# plt.plot(new_r[:,0], new_r[:,1],'.')
You can explore the higher dimensional data manually by changing dim1
and dim2
in the following cell.
[8]:
dim1 = 79
dim2 = 11
plt.scatter(x_train[:,dim1], x_train[:,dim2], color='blue', alpha=0.1)
plt.axis([-10, 10, -10, 10])
plt.title("Simulated data set")
plt.show()

MXFusion Model Definition¶
Import MXFusion and MXNet modelling components
[9]:
from mxfusion.models import Model
import mxnet.gluon.nn as nn
from mxfusion.components import Variable
from mxfusion.components.variables import PositiveTransformation
from mxfusion.components.functions.operators import broadcast_to
The primary data structure in MXFusion is the Model. Models hold ModelComponents, such as Variables, Distributions, and Functions which are the what define a probabilistic model.
The model we’ll be defining for PPCA is:
\(p(z)\) ~ \(N(\mathbf{\mu}, \mathbf{\Sigma)}\)
\(p(x | z,\theta)\) ~ \(N(\mathbf{Wz} + \mu, \Psi)\)
where:
\(z \in \mathbb{R}^{N x K}, \mathbf{\mu} \in \mathbb{R}^K, \mathbf{\Sigma} \in \mathbb{R}^{NxKxK}, x \in \mathbb{R}^{NxD}\)
\(\Psi \in \mathbb{R}^{NxDxD}, \Psi = [\Psi_0, \dots, \Psi_N], \Psi_i = \sigma^2\mathbf{I}\)
\(z\) here is our latent variable of interest, \(x\) is the observed data, and all other variables are parameters or constants of the model.
First we create an MXFusion Model object to build our PPCA model on.
[10]:
m = Model()
We attach Variable
objects to our model to collect them in a centralized place. Internally, these are organized into a factor graph which is used during Inference.
[11]:
m.w = Variable(shape=(K,D), initial_value=mx.nd.array(np.random.randn(K,D)))
Because the mean of \(x\)’s distribution is composed of the dot product of \(z\) and \(W\), we need to create a dot product function. First we create a dot product function in MXNet and then wrap the function into MXFusion using the MXFusionGluonFunction class. m.dot
can then be called like a normal python function and will apply to the variables it is called on.
[12]:
dot = nn.HybridLambda(function='dot')
m.dot = mf.functions.MXFusionGluonFunction(dot, num_outputs=1, broadcastable=False)
Now we define m.z
which has an identity matrix covariance cov
and zero mean.
m.z
and sigma_2
are then used to define m.x
.
Note that both sigma_2
and cov
will be added implicitly into the Model
because they are inputs to m.x
.
[13]:
cov = mx.nd.broadcast_to(mx.nd.expand_dims(mx.nd.array(np.eye(K,K)), 0),shape=(N,K,K))
m.z = mf.distributions.MultivariateNormal.define_variable(mean=mx.nd.zeros(shape=(N,K)), covariance=cov, shape=(N,K))
m.sigma_2 = Variable(shape=(1,), transformation=PositiveTransformation())
m.x = mf.distributions.Normal.define_variable(mean=m.dot(m.z, m.w), variance=broadcast_to(m.sigma_2, (N,D)), shape=(N,D))
Posterior Definition¶
Now that we have our model, we need to define a posterior with parameters for the inference algorithm to optimize. When constructing a Posterior, we pass in the Model it is defined over and ModelComponent’s from the original Model are accessible and visible in the Posterior.
The covariance matrix must continue to be positive definite throughout the optimization process in order to succeed in the Cholesky decomposition when drawing samples or computing the log pdf of q.z
. To satisfy this, we pass the covariance matrix parameters through a Gluon function that forces it into a Symmetric matrix for which suitable initialization values should maintain positive definite-ness throughout the optimization procedure.
[14]:
from mxfusion.inference import BatchInferenceLoop, GradBasedInference, StochasticVariationalInference
class SymmetricMatrix(mx.gluon.HybridBlock):
def hybrid_forward(self, F, x, *args, **kwargs):
return F.sum((F.expand_dims(x, 3)*F.expand_dims(x, 2)), axis=-3)
While this model has an analytical solution, we will run Variational Inference to find the posterior to demonstrate inference in a setting where the answer is known.
We place a multivariate normal prior over \(z\) because that is \(z\)’s prior in the model and we don’t need to approximate anything in this case. Because the form we’re optimizing over is the true model, the optimization is convex and will always converge to the same answer given by classical PCA given enough iterations.
[15]:
q = mf.models.Posterior(m)
sym = mf.components.functions.MXFusionGluonFunction(SymmetricMatrix(), num_outputs=1, broadcastable=False)
cov = Variable(shape=(N,K,K), initial_value=mx.nd.broadcast_to(mx.nd.expand_dims(mx.nd.array(np.eye(K,K) * 1e-2), 0),shape=(N,K,K)))
q.post_cov = sym(cov)
q.post_mean = Variable(shape=(N,K), initial_value=mx.nd.array(np.random.randn(N,K)))
q.z.set_prior(mf.distributions.MultivariateNormal(mean=q.post_mean, covariance=q.post_cov))
We now take our posterior and model, along with an observation pattern (in our case only m.x
is observed) and create an inference algorithm. This inference algorithm is combined with a gradient loop to create the Inference method infr
.
[16]:
observed = [m.x]
alg = StochasticVariationalInference(num_samples=3, model=m, posterior=q, observed=observed)
infr = GradBasedInference(inference_algorithm=alg, grad_loop=BatchInferenceLoop())
The inference method is then initialized with our training data and we run optimiziation for a while until convergence.
[17]:
infr.initialize(x=mx.nd.array(x_train))
[18]:
infr.run(max_iter=1000, learning_rate=1e-2, x=mx.nd.array(x_train))
Once training completes, we retrieve the posterior mean (our trained representation for \(\mathbf{Wz} + \mu\)) from the inference method and plot it. As shown, the plot recovers (up to rotation) the original 2D data quite well.
[19]:
post_z_mean = infr.params[q.z.factor.mean].asnumpy()
[20]:
plt.plot(post_z_mean[:,0], post_z_mean[:,1],'.')
[20]:
[<matplotlib.lines.Line2D at 0x1a1fc3e668>]

Bayesian Neural Network (VI) for classification (under Development)¶
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
# ==============================================================================
[1]:
import warnings
warnings.filterwarnings('ignore')
import mxfusion as mf
import mxnet as mx
import numpy as np
import mxnet.gluon.nn as nn
import mxfusion.components
import mxfusion.inference
Generate Synthetic Data¶
[2]:
import GPy
%matplotlib inline
from pylab import *
np.random.seed(4)
k = GPy.kern.RBF(1, lengthscale=0.1)
x = np.random.rand(200,1)
y = np.random.multivariate_normal(mean=np.zeros((200,)), cov=k.K(x), size=(1,)).T>0.
plot(x[:,0], y[:,0], '.')
[2]:
[<matplotlib.lines.Line2D at 0x1a2bc4aeb8>]

[3]:
D = 10
net = nn.HybridSequential(prefix='nn_')
with net.name_scope():
net.add(nn.Dense(D, activation="tanh", flatten=False, in_units=1))
net.add(nn.Dense(D, activation="tanh", flatten=False, in_units=D))
net.add(nn.Dense(2, flatten=False, in_units=D))
net.initialize(mx.init.Xavier(magnitude=1))
[4]:
from mxfusion.components.variables.var_trans import PositiveTransformation
from mxfusion.inference import VariationalPosteriorForwardSampling
from mxfusion.components.functions.operators import broadcast_to
from mxfusion.components.distributions import Normal, Categorical
from mxfusion import Variable, Model
from mxfusion.components.functions import MXFusionGluonFunction
[5]:
m = Model()
m.N = Variable()
m.f = MXFusionGluonFunction(net, num_outputs=1, broadcastable=False)
m.x = Variable(shape=(m.N,1))
m.r = m.f(m.x)
for _,v in m.r.factor.parameters.items():
v.set_prior(Normal(mean=broadcast_to(mx.nd.array([0]), v.shape),
variance=broadcast_to(mx.nd.array([1.]), v.shape)))
m.y = Categorical.define_variable(log_prob=m.r, shape=(m.N,1), num_classes=2)
print(m)
Model (cf188)
Variable(ca45c) = BroadcastToOperator(data=Variable(8dec8))
Variable(383ef) = BroadcastToOperator(data=Variable(faf0a))
Variable(44754) ~ Normal(mean=Variable(383ef), variance=Variable(ca45c))
Variable(03371) = BroadcastToOperator(data=Variable(ad532))
Variable(88468) = BroadcastToOperator(data=Variable(2110c))
Variable(84fc2) ~ Normal(mean=Variable(88468), variance=Variable(03371))
Variable(1dc39) = BroadcastToOperator(data=Variable(f3d0f))
Variable(77d1c) = BroadcastToOperator(data=Variable(121a5))
Variable(4e7d9) ~ Normal(mean=Variable(77d1c), variance=Variable(1dc39))
Variable(dbd5d) = BroadcastToOperator(data=Variable(68ad8))
Variable(51f11) = BroadcastToOperator(data=Variable(1ac45))
Variable(dccd6) ~ Normal(mean=Variable(51f11), variance=Variable(dbd5d))
Variable(47fa5) = BroadcastToOperator(data=Variable(e966f))
Variable(90359) = BroadcastToOperator(data=Variable(f19f8))
Variable(daaa7) ~ Normal(mean=Variable(90359), variance=Variable(47fa5))
Variable(c310f) = BroadcastToOperator(data=Variable(1378f))
Variable(a21a3) = BroadcastToOperator(data=Variable(91998))
Variable(44c15) ~ Normal(mean=Variable(a21a3), variance=Variable(c310f))
r = GluonFunctionEvaluation(nn_input_0=x, nn_dense0_weight=Variable(44c15), nn_dense0_bias=Variable(daaa7), nn_dense1_weight=Variable(dccd6), nn_dense1_bias=Variable(4e7d9), nn_dense2_weight=Variable(84fc2), nn_dense2_bias=Variable(44754))
y ~ Categorical(log_prob=r)
[6]:
from mxfusion.inference import BatchInferenceLoop, create_Gaussian_meanfield, GradBasedInference, StochasticVariationalInference, MAP
[7]:
observed = [m.y, m.x]
q = create_Gaussian_meanfield(model=m, observed=observed)
alg = StochasticVariationalInference(num_samples=5, model=m, posterior=q, observed=observed)
infr = GradBasedInference(inference_algorithm=alg, grad_loop=BatchInferenceLoop())
[8]:
infr.initialize(y=mx.nd.array(y), x=mx.nd.array(x))
[9]:
for v_name, v in m.r.factor.parameters.items():
infr.params[q[v].factor.mean] = net.collect_params()[v_name].data()
infr.params[q[v].factor.variance] = mx.nd.ones_like(infr.params[q[v].factor.variance])*1e-6
[10]:
infr.run(max_iter=500, learning_rate=1e-1, y=mx.nd.array(y), x=mx.nd.array(x), verbose=True)
Iteration 51 loss: 661.41601562593755
Iteration 101 loss: 301.83178710937525
Iteration 151 loss: 166.81152343757812
Iteration 201 loss: 159.75297546386725
Iteration 251 loss: 154.61776733398438
Iteration 301 loss: 147.10989379882812
Iteration 351 loss: 153.09896850585938
Iteration 401 loss: 131.58213806152344
Iteration 451 loss: 147.08862304687556
Iteration 500 loss: 136.80494689941406
[11]:
# for uuid, v in infr.inference_algorithm.posterior.variables.items():
# if uuid in infr.params.param_dict:
# print(v.name, infr.params[v])
[12]:
xt = np.linspace(0,1,100)[:,None]
[13]:
infr2 = VariationalPosteriorForwardSampling(10, [m.x], infr, [m.r])
res = infr2.run(x=mx.nd.array(xt))
[14]:
yt = res[0].asnumpy()
[15]:
yt_mean = yt.mean(0)
yt_std = yt.std(0)
for i in range(yt.shape[0]):
plot(xt[:,0],1./(1+np.exp(yt[i,:,0]-yt[i,:,1])),'k',alpha=0.2)
plot(x[:,0],y[:,0],'.')
[15]:
[<matplotlib.lines.Line2D at 0x1a2d368ac8>]

Bayesian Neural Network (VI) for regression¶
Zhenwen Dai (2018-8-21)¶
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
# ==============================================================================
[1]:
import warnings
warnings.filterwarnings('ignore')
import os
os.environ['MXNET_ENGINE_TYPE'] = 'NaiveEngine'
import mxfusion as mf
import mxnet as mx
import numpy as np
import mxnet.gluon.nn as nn
import mxfusion.components
import mxfusion.inference
Generate Synthetic Data¶
[2]:
import GPy
%matplotlib inline
from pylab import *
np.random.seed(0)
k = GPy.kern.RBF(1, lengthscale=0.1)
x = np.random.rand(1000,1)
y = np.random.multivariate_normal(mean=np.zeros((1000,)), cov=k.K(x), size=(1,)).T
plot(x[:,0], y[:,0], '.')
[2]:
[<matplotlib.lines.Line2D at 0x10e9cf6d8>]

Model definition¶
[3]:
D = 50
net = nn.HybridSequential(prefix='nn_')
with net.name_scope():
net.add(nn.Dense(D, activation="tanh", in_units=1))
net.add(nn.Dense(D, activation="tanh", in_units=D))
net.add(nn.Dense(1, flatten=True, in_units=D))
net.initialize(mx.init.Xavier(magnitude=3))
[4]:
from mxfusion.components.variables.var_trans import PositiveTransformation
from mxfusion.inference import VariationalPosteriorForwardSampling
from mxfusion.components.functions.operators import broadcast_to
from mxfusion.components.distributions import Normal
from mxfusion import Variable, Model
from mxfusion.components.functions import MXFusionGluonFunction
[5]:
m = Model()
m.N = Variable()
m.f = MXFusionGluonFunction(net, num_outputs=1,broadcastable=False)
m.x = Variable(shape=(m.N,1))
m.v = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=mx.nd.array([0.01]))
m.r = m.f(m.x)
for v in m.r.factor.parameters.values():
v.set_prior(Normal(mean=broadcast_to(mx.nd.array([0]), v.shape),
variance=broadcast_to(mx.nd.array([1.]), v.shape)))
m.y = Normal.define_variable(mean=m.r, variance=broadcast_to(m.v, (m.N,1)), shape=(m.N,1))
print(m)
Model (a0f1e)
Variable(b08ec) = BroadcastToOperator(data=Variable(6c09d))
Variable(1cdae) = BroadcastToOperator(data=Variable(20ed6))
Variable(f6567) ~ Normal(mean=Variable(1cdae), variance=Variable(b08ec))
Variable(11427) = BroadcastToOperator(data=Variable(cb51c))
Variable(6068d) = BroadcastToOperator(data=Variable(a397c))
Variable(0d566) ~ Normal(mean=Variable(6068d), variance=Variable(11427))
Variable(a2806) = BroadcastToOperator(data=Variable(37171))
Variable(64e44) = BroadcastToOperator(data=Variable(58b81))
Variable(591da) ~ Normal(mean=Variable(64e44), variance=Variable(a2806))
Variable(04dac) = BroadcastToOperator(data=Variable(56e87))
Variable(a1d30) = BroadcastToOperator(data=Variable(e500b))
Variable(1caf4) ~ Normal(mean=Variable(a1d30), variance=Variable(04dac))
Variable(7a6fd) = BroadcastToOperator(data=Variable(39c80))
Variable(2bf77) = BroadcastToOperator(data=Variable(2d483))
Variable(c555f) ~ Normal(mean=Variable(2bf77), variance=Variable(7a6fd))
Variable(9c33c) = BroadcastToOperator(data=Variable(19481))
Variable(507a4) = BroadcastToOperator(data=Variable(56583))
Variable(5c091) ~ Normal(mean=Variable(507a4), variance=Variable(9c33c))
Variable(120d7) = BroadcastToOperator(data=v)
r = GluonFunctionEvaluation(nn_input_0=x, nn_dense0_weight=Variable(5c091), nn_dense0_bias=Variable(c555f), nn_dense1_weight=Variable(1caf4), nn_dense1_bias=Variable(591da), nn_dense2_weight=Variable(0d566), nn_dense2_bias=Variable(f6567))
y ~ Normal(mean=r, variance=Variable(120d7))
Inference with Meanfield¶
[6]:
from mxfusion.inference import BatchInferenceLoop, create_Gaussian_meanfield, GradBasedInference, StochasticVariationalInference
[7]:
observed = [m.y, m.x]
q = create_Gaussian_meanfield(model=m, observed=observed)
alg = StochasticVariationalInference(num_samples=3, model=m, posterior=q, observed=observed)
infr = GradBasedInference(inference_algorithm=alg, grad_loop=BatchInferenceLoop())
[8]:
infr.initialize(y=mx.nd.array(y), x=mx.nd.array(x))
[9]:
for v_name, v in m.r.factor.parameters.items():
infr.params[q[v].factor.mean] = net.collect_params()[v_name].data()
infr.params[q[v].factor.variance] = mx.nd.ones_like(infr.params[q[v].factor.variance])*1e-6
[10]:
infr.run(max_iter=2000, learning_rate=1e-2, y=mx.nd.array(y), x=mx.nd.array(x), verbose=True)
Iteration 201 loss: 15813.8652343755
Iteration 401 loss: 11816.2539062575
Iteration 601 loss: 8878.53613281255
Iteration 801 loss: 6882.62353515625
Iteration 1001 loss: 4587.8847656255
Iteration 1201 loss: 3141.453613281255
Iteration 1401 loss: 2384.0412597656255
Iteration 1601 loss: 1506.3929443359375
Iteration 1801 loss: 1371.0905761718755
Iteration 2000 loss: 1076.2847900390625
Use prediction to visualize the resulting BNN¶
[11]:
xt = np.linspace(0,1,100)[:,None]
[12]:
infr2 = VariationalPosteriorForwardSampling(10, [m.x], infr, [m.r])
res = infr2.run(x=mx.nd.array(xt))
[13]:
yt = res[0].asnumpy()
[14]:
yt_mean = yt.mean(0)
yt_std = yt.std(0)
for i in range(yt.shape[0]):
plot(xt[:,0],yt[i,:,0],'k',alpha=0.2)
plot(x[:,0],y[:,0],'.')
[14]:
[<matplotlib.lines.Line2D at 0x10f6b75c0>]

Variational Auto-Encoder (VAE)¶
Zhenwen Dai (2019-05-29)¶
Variational auto-encoder (VAE) is a latent variable model that uses a latent variable to generate data represented in vector form. Consider a latent variable \(x\) and an observed variable \(y\). The plain VAE is defined as
where \(f\) is the deep neural network (DNN), often referred to as the decoder network.
The variational posterior of VAE is defined as
where \(g_{\mu}\) is the encoder networks that generate the mean of the variational posterior of \(x\). For simplicity, we assume that all the data points share the same variance in the variational posteior. This can be extended by generating the variance also from the encoder network.
[1]:
import warnings
warnings.filterwarnings('ignore')
import mxfusion as mf
import mxnet as mx
import numpy as np
import mxnet.gluon.nn as nn
import mxfusion.components
import mxfusion.inference
%matplotlib inline
from pylab import *
Load a toy dataset¶
[2]:
import GPy
data = GPy.util.datasets.oil_100()
Y = data['X']
label = data['Y'].argmax(1)
[3]:
N, D = Y.shape
Model Defintion¶
We first define that the encoder and decoder DNN with MXNet Gluon blocks. Both DNNs have two hidden layers with tanh non-linearity.
[4]:
Q = 2
[5]:
H = 50
encoder = nn.HybridSequential(prefix='encoder_')
with encoder.name_scope():
encoder.add(nn.Dense(H, in_units=D, activation="tanh", flatten=False))
encoder.add(nn.Dense(H, in_units=H, activation="tanh", flatten=False))
encoder.add(nn.Dense(Q, in_units=H, flatten=False))
encoder.initialize(mx.init.Xavier(magnitude=3))
[6]:
H = 50
decoder = nn.HybridSequential(prefix='decoder_')
with decoder.name_scope():
decoder.add(nn.Dense(H, in_units=Q, activation="tanh", flatten=False))
decoder.add(nn.Dense(H, in_units=H, activation="tanh", flatten=False))
decoder.add(nn.Dense(D, in_units=H, flatten=False))
decoder.initialize(mx.init.Xavier(magnitude=3))
Then, we define the model of VAE in MXFusion. Note that for simplicity in implementation, we use scalar normal distributions defined for individual entries of a Matrix instead of multivariate normal distributions with diagonal covariance matrices.
[8]:
from mxfusion.components.variables.var_trans import PositiveTransformation
from mxfusion import Variable, Model, Posterior
from mxfusion.components.functions import MXFusionGluonFunction
from mxfusion.components.distributions import Normal
from mxfusion.components.functions.operators import broadcast_to
m = Model()
m.N = Variable()
m.decoder = MXFusionGluonFunction(decoder, num_outputs=1,broadcastable=True)
m.x = Normal.define_variable(mean=broadcast_to(mx.nd.array([0]), (m.N, Q)),
variance=broadcast_to(mx.nd.array([1]), (m.N, Q)), shape=(m.N, Q))
m.f = m.decoder(m.x)
m.noise_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=mx.nd.array([0.01]))
m.y = Normal.define_variable(mean=m.f, variance=broadcast_to(m.noise_var, (m.N, D)),
shape=(m.N, D))
print(m)
Model (37a04)
Variable (b92c2) = BroadcastToOperator(data=Variable noise_var (a50d4))
Variable (39c2c) = BroadcastToOperator(data=Variable (e1aad))
Variable (b7150) = BroadcastToOperator(data=Variable (a57d4))
Variable x (53056) ~ Normal(mean=Variable (b7150), variance=Variable (39c2c))
Variable f (ad606) = GluonFunctionEvaluation(decoder_input_0=Variable x (53056), decoder_dense0_weight=Variable (b9b70), decoder_dense0_bias=Variable (d95aa), decoder_dense1_weight=Variable (73dc2), decoder_dense1_bias=Variable (b85dd), decoder_dense2_weight=Variable (7a61c), decoder_dense2_bias=Variable (eba91))
Variable y (23bca) ~ Normal(mean=Variable f (ad606), variance=Variable (b92c2))
We also define the variational posterior following the equation above.
[9]:
q = Posterior(m)
q.x_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=mx.nd.array([1e-6]))
q.encoder = MXFusionGluonFunction(encoder, num_outputs=1, broadcastable=True)
q.x_mean = q.encoder(q.y)
q.x.set_prior(Normal(mean=q.x_mean, variance=broadcast_to(q.x_var, q.x.shape)))
print(q)
Posterior (4ec05)
Variable x_mean (86d22) = GluonFunctionEvaluation(encoder_input_0=Variable y (23bca), encoder_dense0_weight=Variable (51b3d), encoder_dense0_bias=Variable (c0092), encoder_dense1_weight=Variable (ad9ef), encoder_dense1_bias=Variable (83db0), encoder_dense2_weight=Variable (78b82), encoder_dense2_bias=Variable (b856d))
Variable (6dc84) = BroadcastToOperator(data=Variable x_var (19d07))
Variable x (53056) ~ Normal(mean=Variable x_mean (86d22), variance=Variable (6dc84))
Variational Inference¶
Variational inference is done via creating an inference object and passing in the stochastic variational inference algorithm.
[11]:
from mxfusion.inference import BatchInferenceLoop, StochasticVariationalInference, GradBasedInference
observed = [m.y]
alg = StochasticVariationalInference(num_samples=3, model=m, posterior=q, observed=observed)
infr = GradBasedInference(inference_algorithm=alg, grad_loop=BatchInferenceLoop())
SVI is a gradient-based algorithm. We can run the algorithm by providing the data and specifying the parameters for the gradient optimizer (the default gradient optimizer is Adam).
[13]:
infr.run(max_iter=2000, learning_rate=1e-2, y=mx.nd.array(Y), verbose=True)
Iteration 200 loss: 1720.556396484375
Iteration 400 loss: 601.11962890625
Iteration 600 loss: 168.620849609375
Iteration 800 loss: -48.67474365234375
Iteration 1000 loss: -207.34835815429688
Iteration 1200 loss: -354.17742919921875
Iteration 1400 loss: -356.26409912109375
Iteration 1600 loss: -561.263427734375
Iteration 1800 loss: -697.8665161132812
Iteration 2000 loss: -753.83203125 8
Plot the training data in the latent space¶
Finally, we may be interested in visualizing the latent space of our dataset. We can do that by calling encoder network.
[17]:
from mxfusion.inference import TransferInference
q_x_mean = q.encoder.gluon_block(mx.nd.array(Y)).asnumpy()
[18]:
for i in range(3):
plot(q_x_mean[label==i,0], q_x_mean[label==i,1], '.')

Gaussian Process Regression¶
Zhenwen Dai (2019-05-29)
Introduction¶
Gaussian process (GP) is a Bayesian non-parametric model used for various machine learning problems such as regression, classification. This notebook shows about how to use a Gaussian process regression model in MXFusion.
[1]:
import warnings
warnings.filterwarnings('ignore')
import os
os.environ['MXNET_ENGINE_TYPE'] = 'NaiveEngine'
Toy data¶
We generate some synthetic data for our regression example. The data set is generate from a sine function with some additive Gaussian noise.
[2]:
import numpy as np
%matplotlib inline
from pylab import *
np.random.seed(0)
X = np.random.uniform(-3.,3.,(20,1))
Y = np.sin(X) + np.random.randn(20,1)*0.05
The generated data are visualized as follows:
[3]:
plot(X, Y, 'rx', label='data points')
_=legend()

Gaussian process regression with Gaussian likelihood¶
Denote a set of input points \(X \in \mathbb{R}^{N \times Q}\). A Gaussian process is often formulated as a multi-variate normal distribution conditioned on the inputs:
For a regression problem, \(F\) is often referred to as the noise-free output and we usually assume an additional probability distribution as the observation noise. In this case, we assume the noise distribution to be Gaussian:
The following code defines the above GP regression in MXFusion. First, we change the default data dtype to double precision to avoid any potential numerical issues.
[4]:
from mxfusion.common import config
config.DEFAULT_DTYPE = 'float64'
In the code below, the variable Y
is defined following the probabilistic module GPRegression
. A probabilistic module in MXFusion is a pre-built probabilistic model with dedicated inference algorithms for computing log-pdf and drawing samples. In this case, GPRegression
defines the above GP regression model with a Gaussian likelihood. It understands that the log-likelihood after marginalizing \(F\) is closed-form and exploits this property when computing log-pdf.
The model is defined by the input variable X
with the shape (m.N, 1)
, where the value of m.N
is discovered when data is given during inference. A positive noise variance variable m.noise_var
is defined with the initial value to be 0.01. For GP, we define a RBF kernel with input dimensionality being one and initial value of variance and lengthscale to be one. We define the variable m.Y
following the GP regression distribution with the above specified kernel, input variable and
noise_variance.
[5]:
from mxfusion import Model, Variable
from mxfusion.components.variables import PositiveTransformation
from mxfusion.components.distributions.gp.kernels import RBF
from mxfusion.modules.gp_modules import GPRegression
m = Model()
m.N = Variable()
m.X = Variable(shape=(m.N, 1))
m.noise_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=0.01)
m.kernel = RBF(input_dim=1, variance=1, lengthscale=1)
m.Y = GPRegression.define_variable(X=m.X, kernel=m.kernel, noise_var=m.noise_var, shape=(m.N, 1))
In the above model, we have not defined any prior distributions for any hyper-parameters. To use the model for regrssion, we typically do a maximum likelihood estimate for all the hyper-parameters conditioned on the input and output variable. In MXFusion, this is done by first creating an inference algorithm, which is MAP
in this case, by specifying the observed variables. Then, we create an inference body for gradient optimization inference methods, which is called GradBasedInference
.
The inference method is triggered by calling the run
method, in which all the observed data are given as keyword arguments and any necessary configruation parameters are specified.
[6]:
import mxnet as mx
from mxfusion.inference import GradBasedInference, MAP
infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.X, m.Y]))
infr.run(X=mx.nd.array(X, dtype='float64'), Y=mx.nd.array(Y, dtype='float64'),
max_iter=100, learning_rate=0.05, verbose=True)
Iteration 10 loss: -13.09287954321266
Iteration 20 loss: -15.971970034359586
Iteration 30 loss: -16.725359053995163
Iteration 40 loss: -16.835084442759314
Iteration 50 loss: -16.850332113428053
Iteration 60 loss: -16.893812683762203
Iteration 70 loss: -16.900137667771077
Iteration 80 loss: -16.901158761459012
Iteration 90 loss: -16.903085976668137
Iteration 100 loss: -16.903135093930537
All the inference outcomes are in the attribute params
of the inference body. The inferred value of a parameter can be access by passing the reference of the queried parameter to the params
attribute. For example, to get the value m.noise_var
, we can call inference.params[m.noise_var]
. The estimated parameters from the above experiment are as follows:
[7]:
print('The estimated variance of the RBF kernel is %f.' % infr.params[m.kernel.variance].asscalar())
print('The estimated length scale of the RBF kernel is %f.' % infr.params[m.kernel.lengthscale].asscalar())
print('The estimated variance of the Gaussian likelihood is %f.' % infr.params[m.noise_var].asscalar())
The estimated variance of the RBF kernel is 0.616992.
The estimated length scale of the RBF kernel is 1.649073.
The estimated variance of the Gaussian likelihood is 0.002251.
We can compare the estimated values with the same model implemented in GPy. The estimated values from GPy are very close to the ones from MXFusion.
[8]:
import GPy
m_gpy = GPy.models.GPRegression(X, Y, kernel=GPy.kern.RBF(1))
m_gpy.optimize()
print(m_gpy)
Name : GP regression
Objective : -16.903456670910902
Number of Parameters : 3
Number of Optimization Parameters : 3
Updates : True
Parameters:
GP_regression. | value | constraints | priors
rbf.variance | 0.6148038604494702 | +ve |
rbf.lengthscale | 1.6500299722611123 | +ve |
Gaussian_noise.variance | 0.002270049772204339 | +ve |
Prediction¶
The above section shows how to estimate the model hyper-parameters of a GP regression model. This is often referred to as training. After training, we are often interested in using the inferred model to predict on unseen inputs. The GP modules offers two types of predictions: predicting the mean and variance of the output variable or drawing samples from the predictive posterior distributions.
Mean and variance of the posterior distribution¶
To estimate the mean and variance of the predictive posterior distribution, we use the inference algorithm ModulePredictionAlgorithm
, which takes the model, the observed variables and the target variables of prediction as input arguments. We use TransferInference
as the inference body, which allows us to take the inference outcome from the previous inference. This is done by passing the inference parameters infr.params
into the infr_params
argument.
[9]:
from mxfusion.inference import TransferInference, ModulePredictionAlgorithm
infr_pred = TransferInference(ModulePredictionAlgorithm(model=m, observed=[m.X], target_variables=[m.Y]),
infr_params=infr.params)
To visualize the fitted model, we make predictions on 100 points evenly spanned from -5 to 5. We estimate the mean and variance of the noise-free output \(F\).
[10]:
xt = np.linspace(-5,5,100)[:, None]
res = infr_pred.run(X=mx.nd.array(xt, dtype='float64'))[0]
f_mean, f_var = res[0].asnumpy()[0], res[1].asnumpy()[0]
The resulting figure is shown as follows:
[11]:
plot(xt, f_mean[:,0], 'b-', label='mean')
plot(xt, f_mean[:,0]-2*np.sqrt(f_var), 'b--', label='2 x std')
plot(xt, f_mean[:,0]+2*np.sqrt(f_var), 'b--')
plot(X, Y, 'rx', label='data points')
ylabel('F')
xlabel('X')
_=legend()

Posterior samples of Gaussian process¶
Apart from getting the mean and variance at every location, we may need to draw samples from the posterior GP. As the output variables at different locations are correlated with each other, each sample gives us some idea of a potential function from the posterior GP distribution.
To draw samples from the posterior distribution, we need to change the prediction inference algorithm attached to the GP module. The default prediction function estimate the mean and variance of the output variable as shown above. We can attach another inference algorithm as the prediction algorithm. In the following code, we attach the GPRegressionSamplingPrediction
algorithm as the prediction algorithm. The targets
and conditionals
arguments specify the target variables of the
algorithm and the conditional variables of the algorithm. After spcifying a name in the alg_name
argument such as gp_predict
, we can access this inference algorithm with the specified name like gp.gp_predict
. In following code, we set the diagonal_variance
attribute to be False
in order to draw samples from a full covariace matrix. To avoid numerical issue, we set a small jitter to help matrix inversion. Then, we create the inference body in the same way as the above example.
[12]:
from mxfusion.inference import TransferInference, ModulePredictionAlgorithm
from mxfusion.modules.gp_modules.gp_regression import GPRegressionSamplingPrediction
gp = m.Y.factor
gp.attach_prediction_algorithms(targets=gp.output_names, conditionals=gp.input_names,
algorithm=GPRegressionSamplingPrediction(
gp._module_graph, gp._extra_graphs[0], [gp._module_graph.X]),
alg_name='gp_predict')
gp.gp_predict.diagonal_variance = False
gp.gp_predict.jitter = 1e-8
infr_pred = TransferInference(ModulePredictionAlgorithm(model=m, observed=[m.X], target_variables=[m.Y], num_samples=5),
infr_params=infr.params)
We draw five samples on the 100 evenly spanned input locations.
[13]:
xt = np.linspace(-5,5,100)[:, None]
y_samples = infr_pred.run(X=mx.nd.array(xt, dtype='float64'))[0].asnumpy()
We visualize the individual samples each with a different color.
[14]:
for i in range(y_samples.shape[0]):
plot(xt, y_samples[i,:,0])

Gaussian process with a mean function¶
In the previous example, we created an GP regression model without a mean function (the mean of GP is zero). It is very easy to extend a GP model with a mean field. First, we create a mean function in MXNet (a neural network). For simplicity, we create a 1D linear function as the mean function.
[21]:
mean_func = mx.gluon.nn.Dense(1, in_units=1, flatten=False)
mean_func.initialize(mx.init.Xavier(magnitude=3))
We create the GP regression model in a similar way as above. The difference is 1. We create a wrapper of the mean function in model definition m.mean_func
. 2. We evaluate the mean function with the input of our GP model, which results into the mean of the GP. 3. We pass the resulting mean into the mean argument of the GP module.
[23]:
m = Model()
m.N = Variable()
m.X = Variable(shape=(m.N, 1))
m.mean_func = MXFusionGluonFunction(mean_func, num_outputs=1, broadcastable=True)
m.mean = m.mean_func(m.X)
m.noise_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=0.01)
m.kernel = RBF(input_dim=1, variance=1, lengthscale=1)
m.Y = GPRegression.define_variable(X=m.X, kernel=m.kernel, noise_var=m.noise_var, mean=m.mean, shape=(m.N, 1))
[24]:
import mxnet as mx
from mxfusion.inference import GradBasedInference, MAP
infr = GradBasedInference(inference_algorithm=MAP(model=m, observed=[m.X, m.Y]))
infr.run(X=mx.nd.array(X, dtype='float64'), Y=mx.nd.array(Y, dtype='float64'),
max_iter=100, learning_rate=0.05, verbose=True)
Iteration 10 loss: -6.288699675985622
Iteration 20 loss: -13.938366520031717
Iteration 30 loss: -16.238146742572965
Iteration 40 loss: -16.214515784955303
Iteration 50 loss: -16.302410205174386
Iteration 60 loss: -16.423765889507315
Iteration 70 loss: -16.512277794947106
Iteration 80 loss: -16.5757306621185
Iteration 90 loss: -16.6410597628529
Iteration 100 loss: -16.702913078848557
[25]:
from mxfusion.inference import TransferInference, ModulePredictionAlgorithm
infr_pred = TransferInference(ModulePredictionAlgorithm(model=m, observed=[m.X], target_variables=[m.Y]),
infr_params=infr.params)
[26]:
xt = np.linspace(-5,5,100)[:, None]
res = infr_pred.run(X=mx.nd.array(xt, dtype='float64'))[0]
f_mean, f_var = res[0].asnumpy()[0], res[1].asnumpy()[0]
[27]:
plot(xt, f_mean[:,0], 'b-', label='mean')
plot(xt, f_mean[:,0]-2*np.sqrt(f_var), 'b--', label='2 x std')
plot(xt, f_mean[:,0]+2*np.sqrt(f_var), 'b--')
plot(X, Y, 'rx', label='data points')
ylabel('F')
xlabel('X')
_=legend()

The effect of the mean function is not noticable, because there is no linear trend in our data. We can plot the values of the estimated parameters of the linear mean function.
[36]:
print("The weight is %f and the bias is %f." %(infr.params[m.mean_func.parameters['dense1_weight']].asnumpy(),
infr.params[m.mean_func.parameters['dense1_bias']].asscalar()))
The weight is 0.021969 and the bias is 0.079038.
Variational sparse Gaussian process regression¶
In MXFusion, we also have variational sparse GP implemented as a module. A sparse GP model can be created in a similar way as the plain GP model.
[39]:
from mxfusion import Model, Variable
from mxfusion.components.variables import PositiveTransformation
from mxfusion.components.distributions.gp.kernels import RBF
from mxfusion.modules.gp_modules import SparseGPRegression
m = Model()
m.N = Variable()
m.X = Variable(shape=(m.N, 1))
m.noise_var = Variable(shape=(1,), transformation=PositiveTransformation(), initial_value=0.01)
m.kernel = RBF(input_dim=1, variance=1, lengthscale=1)
m.Y = SparseGPRegression.define_variable(X=m.X, kernel=m.kernel, noise_var=m.noise_var, shape=(m.N, 1), num_inducing=50)
Developer Tutorials¶
Writing a new Distribution¶
# Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License").
# You may not use this file except in compliance with the License.
# A copy of the License is located at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# or in the "license" file accompanying this file. This file is distributed
# on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
# express or implied. See the License for the specific language governing
# permissions and limitations under the License.
# ==============================================================================
To write and and use a new Distribution class in MXFusion, fill out the Distribution interface and either the Univariate or Multivariate interface, depending on the type of distribution you are creating.
There are 4 primary methods to fill out for a Distribution in MXFusion: * __init__
- This is the constructor for the Distribution. It takes in any parameters the distribution needs. It also defines names for the input variable[s] that the distribution takes and the output variable[s] it produces. * log_pdf
- This method returns the logarithm of probabilistic density function for the distribution. This is called during Inference time as necessary to perform the Inference algorithm. *
draw_samples
- This method returns drawn samples from the distribution. This is called during Inference time as necessary to perform the Inference algorithm. * define_variable
- This is used to generate random variables drawn from the Distribution used during model definition.
log_pdf
and draw_samples
are implemented using MXNet functions to compute on the input variables, which at Inference time are MXNet arrays or MXNet symbolic variables.
This notebook will take the Normal distribution as a reference.
File Structure¶
Code for distributions lives in the mxfusion/components/distributions directory.
If you’re implementing the FancyNew distribution then you should create a file called mxfusion/components/distributions/fancy_new.py for the class to live in.
Interface Implementation¶
Since this example is for a Univariate Normal distribution, our class extends the UnivatiateDistribution class.
The Normal distribution’s constructor takes in objects for its mean
and variance
, specifications for data type and context, and a random number generator if not the default.
In addition, a distribution can take in additional parameters used for calculating that aren’t inputs. We refer to these additional parameters as the Distribution’s attributes
. The difference between an input and an attribute is primarily that inputs are dynamic at inference time, while attributes are static throughout a given inference run.
In this case, minibatch_ratio
is a static attribute, as it doesn’t change for a given minibatch size during inference.
The mean and variance can be either Variables or MXNet arrays if they are constants.
As mentioned above, you define names for the input and output variable[s] for the distribution here. These names are used when printing and generally inspecting the model, so give meaningful names. We prefer names like mean
and variance
to ones like location
and scale
or greek letters like mew
and sigma
.
[ ]:
class Normal(UnivariateDistribution):
"""
The one-dimensional normal distribution. The normal distribution can be defined over a scalar random variable or an array of random variables. In case of an array of random variables, the mean and variance are broadcasted to the shape of the output random variable (array).
:param mean: Mean of the normal distribution.
:type mean: Variable
:param variance: Variance of the normal distribution.
:type variance: Variable
:param rand_gen: the random generator (default: MXNetRandomGenerator)
:type rand_gen: RandomGenerator
:param dtype: the data type for float point numbers
:type dtype: numpy.float32 or numpy.float64
:param ctx: the mxnet context (default: None/current context)
:type ctx: None or mxnet.cpu or mxnet.gpu
"""
def __init__(self, mean, variance, rand_gen=None, minibatch_ratio=1.,
dtype=None, ctx=None):
self.minibatch_ratio = minibatch_ratio
if not isinstance(mean, Variable):
mean = Variable(value=mean)
if not isinstance(variance, Variable):
variance = Variable(value=variance)
inputs = [('mean', mean), ('variance', variance)]
input_names = ['mean', 'variance']
output_names = ['random_variable']
super(Normal, self).__init__(inputs=inputs, outputs=None,
input_names=input_names,
output_names=output_names,
rand_gen=rand_gen, dtype=dtype, ctx=ctx)
If your distribution’s __init__
function only takes in parameters that get passed onto its super constructor, you don’t need to implement replicate_self
. If it does take additional parameters (as the Normal distribution does for minibatch_ratio), those parameters need to be copied over to the replicant Distribution before returning, as below.
[ ]:
def replicate_self(self, attribute_map=None):
"""
Replicates this Factor, using new inputs, outputs, and a new uuid.
Used during model replication to functionally replicate a factor into a new graph.
:param inputs: new input variables of the factor
:type inputs: a dict of {'name' : Variable} or None
:param outputs: new output variables of the factor.
:type outputs: a dict of {'name' : Variable} or None
"""
replicant = super(Normal, self).replicate_self(attribute_map)
replicant.minibatch_ratio = self.minibatch_ratio
return replicant
log_pdf
and draw_samples
are relatively straightforward for implementation. These are the meaningful parts of the Distribution that you’re implementing, putting the math into code using MXNet operators for the compute.
If it’s a distribution that isn’t well documented on Wikipedia, please add a link to a paper or other resource that explains what it’s doing and why.
[ ]:
def log_pdf(self, mean, variance, random_variable, F=None):
"""
Computes the logarithm of the probability density function (PDF) of the normal distribution.
:param mean: the mean of the normal distribution
:type mean: MXNet NDArray or MXNet Symbol
:param variance: the variance of the normal distributions
:type variance: MXNet NDArray or MXNet Symbol
:param random_variable: the random variable of the normal distribution
:type random_variable: MXNet NDArray or MXNet Symbol
:param F: the MXNet computation mode (mxnet.symbol or mxnet.ndarray)
:returns: log pdf of the distribution
:rtypes: MXNet NDArray or MXNet Symbol
"""
F = get_default_MXNet_mode() if F is None else F
logvar = np.log(2 * np.pi) / -2 + F.log(variance) / -2
logL = F.broadcast_add(logvar, F.broadcast_div(F.square(
F.broadcast_minus(random_variable, mean)), -2 * variance))
return logL
[ ]:
def draw_samples(self, mean, variance, rv_shape, num_samples=1, F=None):
"""
Draw samples from the normal distribution.
:param mean: the mean of the normal distribution
:type mean: MXNet NDArray or MXNet Symbol
:param variance: the variance of the normal distributions
:type variance: MXNet NDArray or MXNet Symbol
:param rv_shape: the shape of each sample
:type rv_shape: tuple
:param num_samples: the number of drawn samples (default: one)
:int num_samples: int
:param F: the MXNet computation mode (mxnet.symbol or mxnet.ndarray)
:returns: a set samples of the normal distribution
:rtypes: MXNet NDArray or MXNet Symbol
"""
F = get_default_MXNet_mode() if F is None else F
out_shape = (num_samples,) + rv_shape
return F.broadcast_add(F.broadcast_mul(self._rand_gen.sample_normal(
shape=out_shape, dtype=self.dtype, ctx=self.ctx),
F.sqrt(variance)), mean)
define_variable
is just a helper function for end users. All it does is take in parameters for the distribution, create a distribution based on those parameters, then return the output variables of that distribution.
[ ]:
@staticmethod
def define_variable(mean=0., variance=1., shape=None, rand_gen=None,
minibatch_ratio=1., dtype=None, ctx=None):
"""
Creates and returns a random variable drawn from a normal distribution.
:param mean: Mean of the distribution.
:param variance: Variance of the distribution.
:param shape: the shape of the random variable(s)
:type shape: tuple or [tuple]
:param rand_gen: the random generator (default: MXNetRandomGenerator)
:type rand_gen: RandomGenerator
:param dtype: the data type for float point numbers
:type dtype: numpy.float32 or numpy.float64
:param ctx: the mxnet context (default: None/current context)
:type ctx: None or mxnet.cpu or mxnet.gpu
:returns: the random variables drawn from the normal distribution.
:rtypes: Variable
"""
normal = Normal(mean=mean, variance=variance, rand_gen=rand_gen,
dtype=dtype, ctx=ctx)
normal._generate_outputs(shape=shape)
return normal.random_variable
Using your new distribution¶
At this point, you should be ready to start testing your new Distribution’s functionality by importing it like any other MXFusion component.
Testing¶
Before submitting your new code as a pull request, please write unit tests that verify it works as expected. This should include numerical checks against edge cases. See the existing test cases for the Normal or Categorical distributions for example tests.
[ ]:
Design Overview¶
Topical Guides¶
Working in MXFusion breaks up into two primary phases. Model definition involves defining the variables, distributions, and functions that make up your model. Inference then takes in real values and learns parameters for your model or gives predictions over the data.