Understanding Multivariate Gaussian Process Regression

Gaussian Process Regression (GPR) is a powerful Bayesian machine learning technique that provides not just predictions but also uncertainty estimates. In this post, we’ll explore how GPR works, from the foundational concepts to practical applications, with a focus on the multivariate case.

1. The Essence of Gaussian Processes

At its core, a Gaussian Process (GP) defines a distribution over functions. Rather than parameterizing the function directly (as in linear regression or neural networks), we specify how function values correlate with each other. This correlation is defined by a kernel function.

The key insight: any finite collection of function values from a GP follows a multivariate Gaussian distribution.

Let’s consider a simple regression problem. We want to predict $y = f(x)$ at a new input $x^*$ given observed data pairs $(X, y)$ where $X = {1, 3, 5}$ and $y = {\sin(1), \sin(3), \sin(5)}$.

2. The Mathematical Framework

2.1 The Kernel Function

The kernel (or covariance function) is the heart of GP regression. A common choice is the Radial Basis Function (RBF):

$$k(x, x’) = \sigma_f^2 \exp\left(-\frac{(x – x’)^2}{2\ell^2}\right) \tag{1}$$

Where:

$\sigma_f^2$ controls the output variance
$\ell$ is the length scale parameter, controlling how “smooth” the function is

2.2 From Kernel to Covariance Matrix

Using this kernel, we can compute the covariance between any two function values. For our observed data points $X = {1, 3, 5}$, we get a 3×3 covariance matrix $K(X,X)$.

2.3 The Joint Distribution

The foundation of GP regression is the joint Gaussian distribution between observed values and the prediction point:

\[
\begin{bmatrix}
y \\
f(x^\ast)
\end{bmatrix}
\sim \mathcal{N} \left(
\begin{bmatrix}
m(X) \\
m(x^\ast)
\end{bmatrix},
\begin{bmatrix}
K(X, X) + \sigma_n^2 I & k(X, x^\ast) \\
k(X, x^\ast)^T & k(x^\ast, x^\ast)
\end{bmatrix}
\right)
\tag{2}
\]

Where:

$m(X)$ and $m(x^*)$ are the prior mean functions (often set to zero)
$\sigma_n^2$ represents observation noise
$k(X, x^*)$ is the vector of covariances between the test point and all training points

2.4 Making Predictions: The Posterior Distribution

From the joint distribution, we can derive the posterior distribution conditioning on our observations. This gives us both a predictive mean and variance:

Posterior Mean:

\[
\mu(x^\ast) = m(x^\ast) + k(X, x^\ast)^T [K(X,X) + \sigma_n^2 I]^{-1} (y – m(X)) \tag{3}
\]

Posterior Variance:

\[
\sigma^2(x^\ast) = k(x^\ast, x^\ast) – k(X, x^\ast)^T [K(X,X) + \sigma_n^2 I]^{-1} k(X, x^\ast) \tag{4}
\]

These equations (3) and (4) are the cornerstone of GP prediction, giving us not just the expected value but also uncertainty bounds.

3. Visualization: Making Mathematics Tangible

Interactive Gaussian Process Regression

Select true function:

Number of Training Points:

Length Scale (ℓ):

1.0

Signal Variance (σ²ᶠ):

1.0

Noise Variance (σ²ₙ):

0.10

Kernel Function:

Training points

True function

GP mean prediction

95% confidence region

Log marginal likelihood: N/A

Mean Squared Error: N/A

In this visualization, you can observe:

How the length scale parameter (ℓ in equation 1) affects the smoothness of the function
How the signal variance (σ²ᶠ in equation 1) affects the overall amplitude of variations
How the noise variance (σ²ₙ in equation 2) affects the width of the confidence band

You can see equation (3) and equation (4) in action - the blue line is the posterior mean (equation 3), while the shaded region represents the uncertainty based on the posterior variance (equation 4).

4. The Multivariate Gaussian Distribution

To fully understand GPR, we need to grasp the multivariate Gaussian distribution. The density function for an n-dimensional Gaussian is:

$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right] \tag{5}$$

Where:

$\mu$ is the mean vector
$\Sigma$ is the covariance matrix
$|\Sigma|$ is the determinant of $\Sigma$

The key properties that make multivariate Gaussians so useful for GPs are:

Marginalization: If you integrate out some variables, what remains is still Gaussian
Conditioning: If you condition on some variables, the result is still Gaussian

The second property is precisely what we leverage in equations (3) and (4) to derive our predictions.

5. Working Example: Implementation in Python

Let's look at how to implement GPR in Python using scikit-learn:

Light Modern Code Block

Python

Copied to clipboard!


import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import matplotlib.pyplot as plt
# Generate synthetic data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
# Define kernel
kernel = C(1.0) * RBF(1.0)
# Create GP regressor
gp = GaussianProcessRegressor(kernel=kernel, alpha=0.1**2, n_restarts_optimizer=10)
# Fit model
gp.fit(X, y)
# Make predictions
X_pred = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred, sigma = gp.predict(X_pred, return_std=True)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(X_pred, y_pred, 'b-', label='GP Mean')
plt.fill_between(X_pred.ravel(), y_pred - 2 * sigma, y_pred + 2 * sigma, alpha=0.2, color='blue', label='95% Confidence')
plt.plot(X_pred, np.sin(X_pred), 'r--', label='True Function')
plt.scatter(X, y, c='k', marker='x', label='Observations')
plt.title('Gaussian Process Regression with 100 Observations')
plt.legend()
plt.show()

This code implements exactly what we've been discussing:

It uses the RBF kernel from equation (1)
It forms the joint distribution outlined in equation (2)
It computes the posterior mean and variance as in equations (3) and (4)

6. Connecting All the Pieces

The beauty of Gaussian Processes lies in how elegantly the mathematical theory connects to practical implementation:

From kernel to covariance: The kernel function encodes our assumptions about function smoothness and scale, populating the covariance matrix.
From joint to conditional distribution: Because multivariate Gaussians are closed under conditioning, we get analytic expressions for the posterior.
From math to code: The implementation is remarkably concise because the heavy lifting is done by the matrix algebra in equations (3) and (4).

7. Beyond Basics: Advanced Topics

While we've focused on the fundamentals, there's much more to explore:

Kernel selection: Different kernels encode different assumptions about the function (periodicity, discontinuities, etc.)
Hyperparameter optimization: Learning optimal values for $\ell$, $\sigma_f^2$, and $\sigma_n^2$ via log-marginal likelihood
Sparse GPs: Approximation methods for scaling to large datasets
Multi-output GPs: Handling vector-valued outputs with correlated components

Conclusion

Gaussian Process Regression offers a principled approach to uncertainty quantification in machine learning. By leveraging the elegant properties of multivariate Gaussians, we obtain not just predictions but complete predictive distributions.

The interactive visualization above (between equations (4) and (5)) demonstrates how kernel parameters affect our predictions - play with it to build your intuition!

Next time you face a regression problem where uncertainty matters, consider reaching for this powerful Bayesian technique.

Mohammad Talebi-Kalaleh