Gaussian Process Regression (GPR) is a powerful Bayesian machine learning technique that provides not just predictions but also uncertainty estimates. In this post, we’ll explore how GPR works, from the foundational concepts to practical applications, with a focus on the multivariate case.
1. The Essence of Gaussian Processes
At its core, a Gaussian Process (GP) defines a distribution over functions. Rather than parameterizing the function directly (as in linear regression or neural networks), we specify how function values correlate with each other. This correlation is defined by a kernel function.
The key insight: any finite collection of function values from a GP follows a multivariate Gaussian distribution.
Let’s consider a simple regression problem. We want to predict $y = f(x)$ at a new input $x^*$ given observed data pairs $(X, y)$ where $X = {1, 3, 5}$ and $y = {\sin(1), \sin(3), \sin(5)}$.
2. The Mathematical Framework
2.1 The Kernel Function
The kernel (or covariance function) is the heart of GP regression. A common choice is the Radial Basis Function (RBF):
$$k(x, x’) = \sigma_f^2 \exp\left(-\frac{(x – x’)^2}{2\ell^2}\right) \tag{1}$$
Where:
- $\sigma_f^2$ controls the output variance
- $\ell$ is the length scale parameter, controlling how “smooth” the function is
2.2 From Kernel to Covariance Matrix
Using this kernel, we can compute the covariance between any two function values. For our observed data points $X = {1, 3, 5}$, we get a 3×3 covariance matrix $K(X,X)$.
2.3 The Joint Distribution
The foundation of GP regression is the joint Gaussian distribution between observed values and the prediction point:
\[
\begin{bmatrix}
y \\
f(x^\ast)
\end{bmatrix}
\sim \mathcal{N} \left(
\begin{bmatrix}
m(X) \\
m(x^\ast)
\end{bmatrix},
\begin{bmatrix}
K(X, X) + \sigma_n^2 I & k(X, x^\ast) \\
k(X, x^\ast)^T & k(x^\ast, x^\ast)
\end{bmatrix}
\right)
\tag{2}
\]
Where:
- $m(X)$ and $m(x^*)$ are the prior mean functions (often set to zero)
- $\sigma_n^2$ represents observation noise
- $k(X, x^*)$ is the vector of covariances between the test point and all training points
2.4 Making Predictions: The Posterior Distribution
From the joint distribution, we can derive the posterior distribution conditioning on our observations. This gives us both a predictive mean and variance:
Posterior Mean:
\[
\mu(x^\ast) = m(x^\ast) + k(X, x^\ast)^T [K(X,X) + \sigma_n^2 I]^{-1} (y – m(X)) \tag{3}
\]
Posterior Variance:
\[
\sigma^2(x^\ast) = k(x^\ast, x^\ast) – k(X, x^\ast)^T [K(X,X) + \sigma_n^2 I]^{-1} k(X, x^\ast) \tag{4}
\]
These equations (3) and (4) are the cornerstone of GP prediction, giving us not just the expected value but also uncertainty bounds.
3. Visualization: Making Mathematics Tangible
Interactive Gaussian Process Regression
Log marginal likelihood: N/A
Mean Squared Error: N/A
In this visualization, you can observe:
- How the length scale parameter (ℓ in equation 1) affects the smoothness of the function
- How the signal variance (σ²ᶠ in equation 1) affects the overall amplitude of variations
- How the noise variance (σ²ₙ in equation 2) affects the width of the confidence band
You can see equation (3) and equation (4) in action - the blue line is the posterior mean (equation 3), while the shaded region represents the uncertainty based on the posterior variance (equation 4).
4. The Multivariate Gaussian Distribution
To fully understand GPR, we need to grasp the multivariate Gaussian distribution. The density function for an n-dimensional Gaussian is:
$$p(x; \mu, \Sigma) = \frac{1}{(2\pi)^{n/2}|\Sigma|^{1/2}} \exp\left[-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right] \tag{5}$$
Where:
- $\mu$ is the mean vector
- $\Sigma$ is the covariance matrix
- $|\Sigma|$ is the determinant of $\Sigma$
The key properties that make multivariate Gaussians so useful for GPs are:
- Marginalization: If you integrate out some variables, what remains is still Gaussian
- Conditioning: If you condition on some variables, the result is still Gaussian
The second property is precisely what we leverage in equations (3) and (4) to derive our predictions.
5. Working Example: Implementation in Python
Let's look at how to implement GPR in Python using scikit-learn:
import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import matplotlib.pyplot as plt
# Generate synthetic data
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
# Define kernel
kernel = C(1.0) * RBF(1.0)
# Create GP regressor
gp = GaussianProcessRegressor(kernel=kernel, alpha=0.1**2, n_restarts_optimizer=10)
# Fit model
gp.fit(X, y)
# Make predictions
X_pred = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred, sigma = gp.predict(X_pred, return_std=True)
# Plot
plt.figure(figsize=(10, 6))
plt.plot(X_pred, y_pred, 'b-', label='GP Mean')
plt.fill_between(X_pred.ravel(), y_pred - 2 * sigma, y_pred + 2 * sigma, alpha=0.2, color='blue', label='95% Confidence')
plt.plot(X_pred, np.sin(X_pred), 'r--', label='True Function')
plt.scatter(X, y, c='k', marker='x', label='Observations')
plt.title('Gaussian Process Regression with 100 Observations')
plt.legend()
plt.show()
This code implements exactly what we've been discussing:
- It uses the RBF kernel from equation (1)
- It forms the joint distribution outlined in equation (2)
- It computes the posterior mean and variance as in equations (3) and (4)
6. Connecting All the Pieces
The beauty of Gaussian Processes lies in how elegantly the mathematical theory connects to practical implementation:
- From kernel to covariance: The kernel function encodes our assumptions about function smoothness and scale, populating the covariance matrix.
- From joint to conditional distribution: Because multivariate Gaussians are closed under conditioning, we get analytic expressions for the posterior.
- From math to code: The implementation is remarkably concise because the heavy lifting is done by the matrix algebra in equations (3) and (4).
7. Beyond Basics: Advanced Topics
While we've focused on the fundamentals, there's much more to explore:
- Kernel selection: Different kernels encode different assumptions about the function (periodicity, discontinuities, etc.)
- Hyperparameter optimization: Learning optimal values for $\ell$, $\sigma_f^2$, and $\sigma_n^2$ via log-marginal likelihood
- Sparse GPs: Approximation methods for scaling to large datasets
- Multi-output GPs: Handling vector-valued outputs with correlated components
Conclusion
Gaussian Process Regression offers a principled approach to uncertainty quantification in machine learning. By leveraging the elegant properties of multivariate Gaussians, we obtain not just predictions but complete predictive distributions.
The interactive visualization above (between equations (4) and (5)) demonstrates how kernel parameters affect our predictions - play with it to build your intuition!
Next time you face a regression problem where uncertainty matters, consider reaching for this powerful Bayesian technique.
Your Comment