An Introduction to Gaussian Process Models
by
Thomas Beckers, t.beckers@tum.de
Download as pdf

Within the past two decades, Gaussian process regression has been increasingly used for modeling dynamical systems due to some beneficial properties such as the bias variance trade-off and the strong connection to Bayesian mathematics. As data-driven method, a Gaussian process is a powerful tool for nonlinear function regression without the need of much prior knowledge. In contrast to most of the other techniques, Gaussian Process modeling provides not only a mean prediction but also a measure for the model fidelity. In this article, we give an introduction to Gaussian processes and its usage in regression tasks of dynamical systems.

1 Introduction
2 Gaussian Processes
  2.1 Gaussian Process Regression
  2.2 Multi-output Regression
  2.3 Kernel-based View
  2.4 Reproducing Kernel Hilbert Space
  2.5 Model Error
3 Model Selection
  3.1 Kernel Functions
  3.2 Hyperparameter Optimization
4 Gaussian Process Dynamical Models
  4.1 Gaussian Process State Space Models
  4.2 Gaussian Process Nonlinear Output Error Models
5 Summary
6 Conditional Distribution
References

1 Introduction

A Gaussian process (GP) is a stochastic process that is in general a collection of random variables indexed by time or space. Its special property is that any finite collection of these variables follows a multivariate Gaussian distribution. Thus, the GP is a distribution over infinitely many variables and, therefore, a distribution over functions with a continuous domain. Consequently, it describes a probability distribution over an infinite dimensional vector space. For engineering applications, the GP has gained increasing attention as supervised machine learning technique, where it is used as prior probability distribution over functions in Bayesian inference. The inference of continuous variables leads to Gaussian process regression (GPR) where the prior GP model is updated with training data to obtain a posterior GP distribution. Historically, GPR was used for the prediction of time series, at first presented by Wiener and Kolmogorov in the 1940’s. Afterwards, it became increasingly popular in geostatistics in the 1970’s, where GPR is known as kriging. Recently, it came back in the area of machine learning [Rad96; WR96], especially boosted by the rapidly increasing computational power.
In this article, we present background information about GPs and GPR, mainly based on [Ras06], focusing on the application in control. We start with an introduction of GPs, explain the role of the underlying kernel function and show its relation to reproducing kernel Hilbert spaces. Afterwards, the embedding in dynamical systems and the interpretation of the model uncertainty as error bounds is presented. Several examples are included for an intuitive understanding in addition to the formal notation.

2 Gaussian Processes

Let (Ω_ss, F_σ, P) be a probability space with the sample space Ω_ss, the corresponding σ-algebra F_σ and the probability measure P. The index set is given by Z ⊆ ℝ^n_z with positive integer n_z. Then, a function f_GP(z, ω_ss), which is a measurable function of ω_ss ∈ Ω_ss with index z ∈Z, is called a stochastic process. The function f_GP(z, ω_ss) is a random variable on Ω_ss if z ∈ Z is specified. It is simplified written as f_GP(z). A GP is a stochastic process which is fully described by a mean function m: Z → ℝ and covariance function k: Z ×Z → ℝ such that

with z, z^′ ∈Z. The covariance function is a measure for the correlation of two states (z, z^′) and is called kernel in combination with GPs. Even though no analytic description of the probability density function of the GP exists in general, the interesting property is that any finite collection of its random variables {f_GP(z₁), …, f_GP(z_{n_GP})} follows a n_GP-dimensional multivariate Gaussian distribution. As a GP defines a distribution over functions, each realization is also a function over the index set Z.

Example 1. A GP f_GP(t_c) ∼GP (m (tc),k(tc,t′c)) with time t_c ∈ ℝ_≥0, where

describes a time-dependent electric current signal with Gaussian white noise with a standard deviation of 0.1 A and a mean of 1 A.

2.1 Gaussian Process Regression

The GP can be utilized as prior probability distribution in Bayesian inference, which allows to perform function regression. Following the Bayesian methodology, new information is combined with existing information: using Bayes’ theorem, the prior is combined with new data to obtain a posterior distribution. The new information is expressed as training data set D = {X, Y }. It contains the input values X = [x_dat^{1}, x_dat^{2}, …, x_dat^{n_D}] ∈ Z^1×n_D and output values Y = [ỹ_dat^{1}, ỹ_dat^{2}, …, ỹ_dat^{n_D}]^⊤ ∈ ℝ^n_D, where

for all i = 1, …, n_D. The output data might be corrupted by Gaussian noise ν ∼N(0, σ_n²).

Remark 1. Note that we always use the standard notation X for the input training data and Y for the output training data throughout this report.

As any finite subset of a GP follows a multivariate Gaussian distribution, we can write the joint distribution

for any arbitrary test point z^∗ ∈Z. The function m: Z → ℝ denotes the mean function. The matrix function K : Z^1×n_D ×Z^1×n_D → ℝ^n_D×n_D is called the covariance or Gram matrix with

where each element of the matrix represents the covariance between two elements of the training data X. The expression X_:,l denotes the l-th column of X. For notational simplification, we shorten K(X, X) to K when necessary. The vector-valued kernel function k: Z×Z^1×n_D → ℝ^n_D calculates the covariance between the test input z^∗ and the input training data X, i.e.,

To obtain the posterior predictive distribution of f_GP(z^∗), we condition on the test point z^∗ and the training data set D given by

Thus, the conditional posterior Gaussian distribution is defined by the mean and the variance

A detailed derivation of the posterior mean and variance based on the joint distribution (5) can be found in Section 6. Analyzing (9) we can make the following observations:
i) The mean prediction can be written as

with α = (K + σ_n²I_{n_D})⁻¹ ( )
Y − [m (X:,1),...,m (X:,nD)]⊤ ∈ ℝ^n_D. That formulation highlights the data-driven characteristic of the GPR as the posterior mean is a sum of kernel functions and its number grows with the number n_D of training data.
ii) The variance does not depend on the observed data, but only on the inputs, which is a property of the Gaussian distribution. The variance is the difference between two terms: The first term k(z^∗, z^∗) is simply the prior covariance from which a (positive) term is subtracted, representing the information the observations contain about the function. The uncertainty of the prediction, expressed in the variance, holds only for f_GP(z^∗) and does not consider the noise in the training data. For this purpose, an additional noise term σ_n²I_{n_D} can be added to the variance in (9). Finally, (9) clearly shows the strong dependence of the posterior mean and variance on the kernel k that we will discuss in depth in Section 3.

Example 2. We assume a GP with zero mean and a kernel function given by

as prior distribution. The training data set D is assumed to be

where the output is corrupted by Gaussian noise with σ_n = 0.0498 standard deviation and the test point is assumed to be z^∗ = 5. According to (6) to (9) the Gram matrix K(X, X) is calculated as

and the kernel vector k(z^∗, X) and k(z^∗, z^∗) are obtained to be

Finally, with (9), we compute the predicted mean and variance for f_GP(z^∗)

which is equivalent to a 2σ-standard deviation of 0.0775. Figure 1 shows the prior distribution (left), the posterior distribution with two training points (black crosses) in the middle, and the posterior distribution given the full training set D (right). The solid red line is the mean function and the gray shaded area indicates the 2σ-standard deviation. Five realizations (dashed lines) visualize the character of the distribution over functions.

Figure 1: The prior distribution of a GP is updated with data that leads to the posterior distribution.

2.2 Multi-output Regression

So far, the GP regression allows functions with scalar outputs as in (9). For the extension to vector-valued outputs, multiple approaches exist: i) Extending the kernel to multivariate outputs [ÁRL12], ii) adding the output dimension as training data [Ber+17] or iii) using separated GPR for each output [Ras06]. While the first two approaches set a prior on the correlation between the output dimensions, the latter disregards a correlation without loss of generality. Following the approach in iii), the previous definition of the training set D is extended to a vector-valued output with

where n_ydat ∈ ℕ is the dimension of the output and the vector-valued GP is defined by

Following (5) to (9), we obtain for the predicted mean and variance

for each output dimension i ∈{1, …, n_ydat} with respect to the kernels k¹, …, k^n_ydat. The variable σ_n,i denotes the standard deviation of the Gaussian noise that corrupts the i-th dimension of the output measurements. The n_ydat components of f_GP|z^∗, D are combined into a multi-variable Gaussian distribution with

where Σ(f_GP|z^∗, D) denotes the posterior variance matrix. This formulation allows to use a GP prior on vector-valued functions to perform predictions for test points z^∗. This approach treats each output dimension separately which is mostly sufficient and easy-to-handle. An alternative approach is to include the dimension as additional input, e.g., as in [Ber+17], with the benefit of a single GP at the price of loss of interpretability. For highly correlated output data, a multi-output kernel might be beneficial, see [ÁRL12].

Remark 2. Without specific knowledge about a trend in the data, the prior mean functions m¹, …, m^n_ydat are often set to zero, see [Ras06]. Therefore, we set the mean functions to zero for the remainder of the report if not stated otherwise.

2.3 Kernel-based View

In Section 2.1, we target the GPR from a Bayesian perspective. However, for some applications of GPR a different point of view is beneficial; namely from the kernel perspective. In the following, we derive GPR from linear regression that is extended with a kernel transformation. In general, the prediction of parametric models is based on a parameter vector w which is typically learned using a set of training data points. In contrast, non-parametric models typically maintain at least a subset of the training data points in memory in order to make predictions for new data points. Many linear models can be transformed into a dual representation where the prediction is based on a linear combination of kernel functions. The idea is to transform the data points of a model to an often high-dimensional feature space where a linear regression can be applied to predict the model output, as depicted in Fig. 2. For a nonlinear feature map ϕ: Z →F, where F is a n_ϕ ∈ ℕ ∪{∞} dimensional Hilbert space, the kernel function is given by the inner product k(z, z^′) = ⟨ϕ(z), ϕ(z^′)⟩, ∀z, z^′ ∈ Z.

Figure 2: The mapping ϕ transforms the data points into a feature space where linear regressors can be applied to predict the output.

Thus, the kernel implicitly encodes the way the data points are transformed into a higher dimensional space. The formulation as inner product in a feature space allows to extend many standard regression methods. Also the GPR can be derived using a standard linear regression model

where z ∈ Z is the input vector, w ∈ ℝ^n_z the vector of weights with n_z = dim(Z) and f_lin: Z → ℝ the unknown function. The observed value ỹ_dat^{i} ∈ ℝ for the input x_dat^{i} ∈Z is corrupted by Gaussian noise ν ∼N(0, σ_n²) for all i = 1, …, n_D. The analysis of this model is analogous to the standard linear regression, i.e., we put a prior on the weights such that w ∼ N(0, Σ_p) with Σ_p ∈ ℝ^n_z×n_z. Based on n_D collected training data points as defined in Section 2.1, that leads to the well known linear Bayesian regression

where A_lin = σ_n⁻²XX^⊤ + Σ_p⁻¹. Now, using the feature map ϕ(z) instead of z directly, leads to f_GP(z) = ϕ(z)^⊤w with w ∼N(0, Σ_p), Σ_p ∈ ℝ^n_ϕ×n_ϕ. As long as the projections are fixed functions, i.e., independent of the parameters w, the model is still linear in the parameters and, thus, analytically tractable. In particular, the Bayesian regression (17) with the mapping ϕ(z) can be written as

with the matrix A_GP ∈ ℝ^n_ϕ×n_ϕ given by

This equation can be simplified and rewritten to

with k(z, z^′) = ϕ(z)^⊤Σ_pϕ(z^′) that equals (9). The fact that in (20) the feature map ϕ(z) is not needed is known as the kernel trick. This trick is also used in other kernel-based models, e.g., support vector machines (SVM), see [SC08] for more details.

2.4 Reproducing Kernel Hilbert Space

Even though a kernel neither uniquely defines the feature map nor the feature space, one can always construct a canonical feature space, namely the reproducing kernel Hilbert space (RKHS) given a certain kernel. After the introduction of the theory, illustrative examples for an intuitive understanding are presented. We will now formally present this construction procedure, starting with the concept of Hilbert spaces, following [BLG16]: A Hilbert space F represents all possible realizations of some class of functions, for example all functions of continuity degree i, denoted by Cⁱ. Moreover, a Hilbert space is a vector space such that any function f_F ∈F must have a non-negative norm, ∥f_F∥_F > 0 for f_F≠0. All functions f_F must additionally be equipped with an inner-product in F. Simply speaking, a Hilbert space is an infinite dimensional vector space, where many operations behave like in the finite case. The properties of Hilbert spaces have been explored in great detail in literature, e.g., in [DM+05]. An extremely useful property of Hilbert spaces is that they are equivalent to an associated kernel function [Aro50]. This equivalence allows to simply define a kernel, instead of fully defining the associated vector space. Formally speaking, if a Hilbert space H is a RKHS, it will have a unique positive definite kernel k: Z ×Z → ℝ, which spans the space H.

Theorem 1 (Moore-Aronszajn [Aro50]). Every positive definite kernel k is associated with a unique RKHS H.

Theorem 2 ([Aro50]). Let F be a Hilbert space, Z a non-empty set and ϕ: Z →F. Then, the inner product ⟨ϕ(z), ϕ(z^′)⟩_F : = k(z, z^′) is positive definite.

Importantly, any function f_H in H can be represented as a weighted linear sum of this kernel evaluated over the space H, as

with α_i ∈ ℝ for all i = {1, …, n_ϕ}, where n_ϕ ∈ ℕ ∪{∞} is the dimension of the feature space F. Thus, the RKHS is equipped with the inner-product

with f_H^′(⋅) = ∑_j=1^n_ϕα_j^′k ( ′{j} )
xdat ,⋅ ∈H, α_j^′ ∈ ℝ. Now, the reproducing character manifests as

According to [SHS06], the RKHS is then defined as

where ϕ(z) is the feature map constructing the kernel through k(z, z^′) = ⟨ϕ(z), ϕ(z^′)⟩_F.

Example 3. We want to find the RKHS for the polynomial kernel with degree 2 that is given by

for any z, z^′ ∈ ℝ². First, we have to find a feature map ϕ such that the kernel corresponds to the inner product k(z, y) = ⟨ϕ(z), ϕ(y)⟩. A possible candidate for the feature map is

We know that the RKHS contains all linear combinations of the form

with α, c, x_dat^{i} ∈ ℝ³. Therefore, a possible candidate for the RKHS H is given by

Next, it must be checked if the proposed Hilbert space is the related RKHS to the polynomial kernel with degree 2. This is achieved in two steps: i) Checking if the space is a Hilbert space and ii) confirming the reproducing property. First, we can easily proof that this is a Hilbert space rewriting f_H(z) = z^⊤Sz with symmetric matrix S ∈ ℝ^2×2 and using the fact that H is euclidean and isomorphic to S. Second, the condition for an RKHS must be fulfilled, i.e., the reproducing property f_H(z) = ⟨f_H(⋅), k(⋅, z)⟩_H. Since we can write

property (23) is fulfilled and, thus, H is the RKHS for the polynomial kernel with degree 2. Note that, even though the mapping ϕ is not unique for the kernel k, the relation of k and the RKHS H is unique.

Given a function f_H ∈ H defined by n_D observations, its RKHS norm is defined as

with α ∈ ℝ^n_D and K(X, X) given by (6). We can also use the feature map such that

As there is a unique relation between the RKHS H and the kernel k, the norm ∥f_H∥_H can equivalently be written as ∥f_H∥_k. The norm of a function in the RKHS indicates how fast the function varies over Z with respect to the geometry defined by the kernel. Formally, it can be written as

with the distance d(z, z^′)² = k(z, z) − 2k(z, z^′) + k(z^′, z^′). A function with finite RKHS norm is also element of the RKHS. A more detailed discussion about RKHS and norms is given in [Wah90].

Example 4. We want to find the RKHS norm of a function f_H that is an element of the RKHS of the polynomial kernel with degree 2 that is given by

Let the function be

Hence, function (29) with (30) and (31) corresponds to

Now, we have two possibilities how to calculate the RKHS norm. First, the RKHS-norm of f_H is calculated using (26) by

with X = [x_dat^{1}, x_dat^{2}, x_dat^{3}]. Alternatively, we can use (27) that results in ∥f_H∥_H = ∥c∥, where c is defined by (25). Thus, the norm is computed as

Example 5. In this example, we visualize the meaning of the RKHS norm. Figure 3 shows different quadratic functions with the same RKHS norm (top left and top right), a smaller RKHS norm (bottom left) and a larger RKHS norm (bottom right). An identical norm indicates a similar variation of the functions, whereas a higher norm leads to a more varying function.

Figure 3: Functions with different RKHS-norms: ∥f₁∥_H² = ∥f₂∥_H² = 4∥f₃∥_H² = 1
2

∥f₄∥_H².

In summary, we investigate the unique relation between the kernel and its RKHS. The reproducing property allows us to write the inner-product as a tractable function which implicitly defines a higher (or even infinite) feature dimensional space. The RKHS-norm of a function is a Lipschitz-like indicator based on the metric defined by the kernel. This view of the RKHS is related to the kernel trick in machine learning. In the next section, the RKHS-norm is exploited to determine the error between the prediction of GPR and the actual data-generating function.

2.5 Model Error

One of the most interesting properties of GPR is the uncertainty description encoded in the predicted variance. This uncertainty is beneficial to quantify the error between the actual underlying data generating process and the GPR. In this section, we assume that there is an unknown function f_uk: ℝ^n_z → ℝ that generates the training data. In detail, the data set D = {X, Y } consists of

where the data is generated by

for all i = {1, …, n_D}. Without any assumptions on f_uk it is obviously not possible to quantify the model error. Loosely speaking, the prior distribution of the GPR with kernel k must be suitable to learn the unknown function. More technically, f_uk must be an element of the RKHS spanned by the kernel as described in (24). This leads to the following assumption.

Assumption 1. The function f_uk has a finite RKHS norm with respect to the kernel k, i.e., ∥f_uk∥_H < ∞, where H is the RKHS spanned by k.

This sounds paradoxical as f_uk is assumed to be unknown. However, there exist kernels that can approximate any continuous function arbitrarily exact. Thus, for any continuous function, an arbitrarily close function is element of the RKHS of an universal kernel. For more details, we refer to Section 2.4. More infomation about the model error of misspecified Gaussian Process model can be found in [BUH18]
We classify the error quantification in three different approaches: i) the robust approach, ii) the scenario approach, and iii) the information-theoretical approach. The different techniques are presented in the following and visualized in Fig. 4. For the remainder of this section, we assume that a GPR is trained with the data set (32) and Assumption 1 holds.

2.5.1 Robust approach

The robust approach exploits the fact that the prediction of the GPR is Gaussian distributed. Thus, for any z^∗ ∈ ℝ^n_z, the model error is bounded by

with high probability where c ∈ ℝ_>0 adjusts the probability. However, for multiple test points z₁^∗, z₂^∗, … ∈ ℝ^n_z, this approach neglects any correlation between f_GP(z₁^∗), f_GP(z₂^∗), …. Figure 4 shows how for a given z₁^∗ and z₂^∗, the variance is exploited as upper bound. Thus, any prediction is handled independently, which leads to a very conservative bound, see [UBH18].

2.5.2 Scenario approach

Instead of using the mean and the variance as in the robust approach, the scenario approach deals with the samples of the GPR directly. In contrast to the other methods, there is no direct model error quantification but rather a sample based quantification. The idea is to draw a large number n_scen ∈ ℕ of sample functions f_GP¹, f_GP², …, f_GP^n_scen over n_s ∈ ℕ sampling points. The sampling is performed by drawing multiple instances from f_GP given by the multivariate Gaussian distribution

where X^∗ = [z₁^∗, ⋅⋅⋅ , z_{n_s}^∗] contains the sampling points. Each sample can then be used in the application instead of the unknown function. For a large number of samples it is assumed that the unknown function is close to one of these samples. However, the crux of this approach is to determine, for a given model error c ∈ ℝ_>0, the required number of samples n_scen and probability δ_scen > 0 such that

for all z^∗ ∈ Z. In Fig. 4, five different samples of a GP model are drawn as example.

2.5.3 Information-theoretical approach

Alternatively, the work in [Sri+12] derives an upper bound for samples of the GPR on a compact set with a specific probability. In contrast to the robust approach, the correlation between the function values are considered. We restate here the theorem of [Sri+12].

Theorem 3 ([Sri+12]). Given Assumption 1, the model error Δ ∈ ℝ

is bounded for all z on a compact set Ω ⊂ ℝ^n_z with a probability of at least δ ∈ (0, 1) by

where β ∈ ℝ is defined as

The variable γ_max ∈ ℝ is the maximum of the information gain

with Gram matrix K(z, z^′) and the input elements z, z^′ ∈{x_dat^{1}, …, x_dat^{n_D+1}}.

To compute this bound, the RKHS norm of f_uk must be known. That is in application usually not the case. However, often the norm can be upper bounded and thus, the bound in Theorem 3 can be upper bounded. For this purpose, the relation of the RKHS norm to the Lipschitz constant given by (28) is beneficial as the Lipschitz constant is more likely to be known. In general, the computation of the information gain is a non-convex optimization problem. However, the information capacity γ_max has a sub-linear dependency on the number of training points for many commonly used kernel functions [Sri+12]. Therefore, even though β is increasing with the number of training data, it is possible to learn the true function f_uk arbitrarily exactly [Ber+16]. In contrast to the other approaches, Theorem 3 allows to bound the error for any test point in a compact set. In [BKH19], we exploit this approach in GP model based control tasks. The right illustration of Fig. 4 visualizes the information-theoretical bound.

Figure 4: Different approaches to quantify the model error: Robust approach (left), scenario approach (middle), information-theoretical approach (right).

3 Model Selection

Equation (9) clearly shows the immense impact of the kernel on the posterior mean and variance. However, this is not surprising as the kernel is an essential part of the prior model. For practical applications that leads to the question how to choose the kernel. Additionally, most kernels depend on a set of hyperparameters that must be defined. Thus in order to turn GPR into a powerful practical tool it is essential to develop methods that address the model selection problem. We see the model selection as the determination of the kernel and its hyperparameters. We only focus on kernels that are defined on Z ⊆ ℝ^n_z. In the next two subsections, we present different kernels and explain the role of the hyperparameters and their selection, mainly based on [Ras06].

Remark 3. The selection of the kernel functions seems to be similar to the model selection for parametric models. However, there are two major differences: i) the selection is fully covered by the Bayesian methodology and ii) many kernels allow to model a wide range of different functions whereas parametric models a typically limited to very specific types of functions.

3.1 Kernel Functions

The value of the kernel function k(z, z^′) is an indicator of the interaction of two states (z, z^′). Thus, an essential part of GPR is the selection of the kernel function and the estimation of its free parameters φ₁, φ₂, …, φ_{n_φ}, called hyperparameters. The number n_φ of hyperparameters depends on the kernel function. The choice of the kernel function and the determination of the corresponding hyperparameters can be seen as degrees of freedom of the regression. First of all, we start with the general properties of a function to be qualified as a kernel for GPR. A necessary and sufficient condition for the function k: Z ×Z → ℝ to be a valid kernel is that the Gram matrix, see (6), is positive semidefinite for all possible input values [SC04].

Remark 4. As shown in Section 2.4, the kernel function must be positive definite to span an unique RKHS. That seems to be contradictory to the required positive semi-definiteness of the Gram matrix. The solution is the definition of positive definite kernels as it is equivalent to a positive semi-definite Gram matrix. In detail, a symmetric function k: Z×Z → ℝ is a positive definite kernel on Z if

holds for any n_D ∈ ℕ, x_dat^{1}, …, x_dat^{n_D} ∈Z and c₁, …, c_n ∈ ℝ. Thus, there exists a positive semi-definite matrix A_G ∈ ℝ^n_D×n_D such that

holds for any n_D ∈ ℕ and z ∈Z.

The set of functions k which fulfill this condition is denoted with K. Kernel functions can be separated into two classes, the stationary and non-stationary kernels. A stationary kernel is a function of the distance z −z^′. Thus, it is invariant to translations in the input space. In contrast, non-stationary kernels depend directly on z,z^′ and are often functions of a dot product z^⊤z. In the following, we list some common kernel functions with their basic properties. Even though the number of presented kernels is limited, new kernels can be constructed easily as K is closed under specific operations such as addition and scalar multiplication. At the end, we summarize the equation of each kernel in Table 1 and provide a comparative example.

3.1.1 Constant Kernel

The equation for the constant kernel is given by

This kernel is mostly used in addition to other kernel functions. It depends one a single hyperparameter φ₁ ∈ ℝ_≥0.

3.1.2 Linear Kernel

The equation for the linear kernel is given by

The linear kernel is a dot-product kernel and thus, non-stationary. The kernel can be obtained from Bayesian linear regression as shown in Section 2.3. The linear kernel is often used in combination with the constant kernel to include a bias.

3.1.3 Polynomial Kernel

The equation for the polynomial kernel is given by

The polynomial kernel has an additional parameter p ∈ ℕ, that determines the degree of the polynomial. Since a dot product is contained, the kernel is also non-stationary. The prior variance grows rapidly for ∥z∥ > 1 such that the usage for some regression problems is limited. It depends on a single hyperparameter φ₁ ∈ ℝ_≥0.

3.1.4 Matérn Kernel

The equation for the Matérn kernel is given by

with p = p + 1
2 , p ∈ ℕ. The Matérn kernel is a very powerful kernel and presented here with the most common parameterization for p. Functions drawn from a GP model with Matérn kernel are p-times differentiable. The more general equation of this stationary kernel can be found in [Bis06]. This kernel is an universal kernel which is explained in the following.

Lemma 1 ([SC08, Lemma 4.55]). Consider the RKHS H(Z_c) of an universal kernel on any prescribed compact subset Z_c ∈Z. Given any positive number 𝜀 and any function f_C ∈ C¹(Z_C), there is a function f_H ∈H(Z_c) such that ∥f_C − f_H∥_{Z_c} ≤ 𝜀.

Intuitively speaking, a GPR with an universal kernel can approximate any continuous function arbitrarily exact on a compact set. For p →∞, it results in the squared exponential kernel. The two hyperparameters are φ₁ ∈ ℝ_≥0 and φ₂ ∈ ℝ_>0.

3.1.5 Squared Exponential Kernel

The equation for the squared exponential kernel is given by

Probably the most widely used kernel function for GPR is the squared exponential kernel, see [Ras06]. The hyperparameter φ₁ describes the signal variance which determines the average distance of the data-generating function from its mean. The lengthscale φ₂ defines how far it is needed to move along a particular axis in input space for the function values to become uncorrelated. Formally, the lengthscale determines the number of expected upcrossings of the level zero in a unit interval by a zero-mean GP. The squared exponential kernel is infinitely differentiable, which means that the GPR exhibits a smooth behavior. As limit of the Matérn kernel, it is also an universal kernel, see [MXZ06].

Example 6. Figure 5 shows the power for regression of universal kernel functions. In this example, a GPR with squared exponential kernel is used for different training data sets. The hyperparameter are optimized individually for each training data set by means of the likelihood, see Section 3.2. Note that all presented regressions are based on the same GP model, i.e. the same kernel function, but with different data sets. That highlights again the superior flexibility of GPR.

Figure 5: Examples for the flexibility of the regression that all are based on the same GP model.

3.1.6 Rational Quadratic Kernel

The equation for the rational quadratic kernel is given by

This kernel is equivalent to summing over infinitely many squared exponential kernels with different lengthscales. Hence, GP priors with this kernel are expected to see functions which vary smoothly across many lengthscales. The parameter p determines the relative weighting of large-scale and small-scale variations. For p →∞, the rational quadratic kernel is identical to the squared exponential kernel.

3.1.7 Squared Exponential ARD Kernel

The equation for the squared exponential ARD kernel is given by

The automatic relevance determination (ARD) extension to the squared exponential kernel allows for independent lenghtscales φ₂, …, φ_{1+n_z} ∈ ℝ_>0 for each dimension of z, z^′ ∈ ℝ^n_z. The individual lenghtscales are typically larger for dimensions which are irrelevant as the covariance will become almost independent of that input. A more detailed discussion about the advantages of different kernels can be found, for instance, in [Mac97] and [Bis06].

Example 7. In this example, we use three GPRs with the same set of training data

but with different kernels, namely the squared exponential (47), the linear (44), and the polynomial (45) kernel. Figure 6 shows the different shapes of the regressions with the posterior mean (red), the posterior variance (gray shaded) and the training points (black). Even for this simple data set, the flexibility of the squared exponential kernel is already visible.

Figure 6: GPR with different kernels: squared exponential (left), linear (middle) and polynomial with degree 2 (right).

Kernel name

k(z, z^′) =

Constant

2 φ 1

Linear

⊤ ′ 2 z z + φ1

Polynomial p ∈ ℕ

( ⊤ ′ 2)p z z + φ1

Matérn p = p + 0.5, p ∈ ℕ

( √ --- ′ ) ∑p ( √ --- ′ )p−i φ2 exp − --2ˇp∥z-−-z-∥- -p!-- -(p-+-i)!- --8ˇp∥z-−--z∥- 1 φ2 (2p)!i=0i!(p − i)! φ2

Squared exponential

( ) 2 ∥z-−-z-′∥2- φ1 exp − 2φ2 2

Rational quadratic

( ′2 )−p φ21 1 + ∥z-−-z-∥-- 2p φ22

Squared exponential ARD

\text{[math]}

Table 1: Overview of some commonly used kernel functions.

3.2 Hyperparameter Optimization

In addition to the selection of a kernel function, values for any hyperparameter must be determined to perform the regression. The number of hyperparameters depends on the kernel function used. We concatenate all hyperparameters in a vector φ with size n_φ ∈ ℕ, where φ ∈ Φ ⊆ ℝ^n_φ. The hyperparameter set Φ is introduced to cover the different spaces of the individual hyperparameters as defined in the following.

Definition 1. The set Φ is called a hyperparameter set for a kernel function k if and only if the set Φ is a domain for the hyperparameters φ of k.

Often, the signal noise σ_n², see (5), is also treated as hyperparameter. For a better understanding, we keep the signal noise separated from the hyperparameters. There exist several techniques that allow computing the hyperparameters and the signal noise with respect to one optimality criterion. From a Bayesian perspective, we want to find the vector of hyperparameters φ which are most likely for the output data Y given the inputs X and a GP model. For this purpose, one approach is to optimize the log marginal likelihood function of the GP. Another idea is to split the training set into two disjoint sets, one which is actually used for training, and the other, the validation set, which is used to monitor performance. This approach is known as cross-validation. In the following, these two techniques for the selection of hyperparameters are presented.

3.2.1 Log Marginal Likelihood Approach

A very common method for the optimization of the hyperparameters is by means of the negative log marginal likelihood function, often simply named as (neg. log) likelihood function. It is marginal since it is obtained through marginalization over the function f_GP. The marginal likelihood is the likelihood that the output data Y ∈ ℝ^n_D fits to the input data X with the hyperparameters φ. It is given by

A detailed derivation can be found in [Ras06]. The three terms of the marginal likelihood in (51) have the following roles:

Y ^⊤(K + σ_n²I_{n_D})⁻¹Y is the only term that depends on the output data Y and represents the data-fit.
log |K +σ_n²I_{n_D}| penalizes the complexity depending on the kernel function and the input data X.
log 2π is a normalization constant.

Remark 5. For the sake of notational simplicity, we suppress the dependency on the hyperparameters of the kernel function k whenever possible.

The optimal hyperparameters φ^∗ ∈ Φ and signal noise σ_n^∗ in the sense of the likelihood are obtained as the minimum of the negative log marginal likelihood function

Since an analytic solution of the derivation of (51) is impossible, a gradient based optimization algorithm is typically used to minimize the function. However, the negative log likelihood is non-convex in general such that there is no guarantee to find the optimum φ^∗, σ_n^∗. In fact, every local minimum corresponds to a particular interpretation of the data. In the following example, we visualize how the hyperparameters affect the regression.

Example 8. A GPR with the squared exponential kernel is trained on eight data points. The signal variance is fixed to φ₁ = 2.13. First, we visualize the influence of the lengthscale. For this purpose, the signal noise is fixed to σ_n = 0.21. Figure 7 shows the posterior mean of the regression and the neg. log likelihood function. On the left side are three posterior means for different lengthscales. A short lengthscale results in overfitting whereas a large lengthscale smooths out the training data (black crosses). The dotted red function represents the mean with optimized lengthscale by a descent gradient algorithm with respect to (52). The right plot shows the neg. log likelihood over the signal variance φ₁ and lengthscale φ₂. The minimum is here at φ^∗ = [2.13, 1.58]^⊤.

Figure 7: Left: Regression with different lengthscales: φ₂ = 0.67 (cyan, solid), φ₂ = 7.39 (brown, dashed), and φ₂ = 1.58 (red, dotted). Right: Neg. log likelihood function over signal variance φ₁ and lengthscale φ₂.

Next, the meaning of different interpretations of the data is visualized by varying the signal noise σ_n and the lengthscale φ₂. The right plot of Fig. 8 shows two minima of the negative log likelihood function. The lower left minimum at log(σ_n) = 0.73 and log(φ₂) = −1.51 interprets the data as slightly noisy which leads to the dotted red posterior mean in the left plot. In contrast, the upper right minimum at log(σ_n) = 5 and log(φ₂) = −0.24 interprets the data as very noisy without a trend, which manifests as the cyan posterior mean in the left plot. Depending on the initial value, a gradient based optimizer would terminate in one of these minima.

Figure 8: Left: Different interpretation of the data: Noisy data without a trend (cyan, solid) and slightly noisy data (red, dotted). Right: Negative log likelihood function over signal noise σ_n and lengthscale φ₂.

3.2.2 Cross-validation Approach

This approach works with a separation of the data set D in two classes: one for training and one for validation. Cross-validation is almost always used in the k_cv-fold cross-validation setting: the k_cv-fold cross-validation data is split into k_cv disjoint, equally sized subsets; validation is done on a single subset and training is done using the union of the remaining k_cv − 1 subsets, the entire procedure is repeated k_cv times, each time with a different subset for validation. Here, without loss of generality, we present the leave-one-out cross-validation, which means k_cv = n_D. The predictive log probability when leaving out a training point {x_dat^{i}, ỹ_dat^{i}} is given by

where μ_−i = μ(f_GP(x_dat^{i})|x_dat^{i}, X_:,−i, Y _−i) and var _−i = var(f_GP(x_dat^{i})|x_dat^{i}, X_:,−i, Y _−i). The −i index indicates X and Y without the element x_dat^{i} and ỹ_dat^{i}, respectively. Thus, (53) is the probability for the output y_dat^{i} at x_dat^{i} but without the training point {x_dat^{i}, ỹ^{i}}. Accordingly, the leave-one-out log predictive probability L_LOO ∈ ℝ is

In comparison to the log likelihood approach (52), the cross-validation is in general more computationally expensive but might find a better representation of the data set, see [GE79] for discussion and related approaches.

4 Gaussian Process Dynamical Models

So far, we consider GPR in non-dynamical settings where only an input-to-output mapping is considered. However, Gaussian process dynamical models (GPDMs) have recently become a versatile tool in system identification because of their beneficial properties such as the bias variance trade-off and the strong connection to Bayesian mathematics, see [FCR14]. In many works, where GPs are applied to dynamical model, only the mean function of the process is employed, e.g., in [WHB05] and [Cho+13]. This is mainly because GP models are often used to replace deterministic parametric models. However, GPDMs contain a much richer description of the underlying dynamics, but also the uncertainty about the model itself when the full probabilistic representation is considered. Therefore, one main aspect of GPDMs is to distinguish between recurrent structures and non-recurrent structures. A model is called recurrent if parts of the regression vector depend on the outputs of the model. Even though recurrent models become more complex in terms of their behavior, they allow to model sequences of data, see [Sjö+95]. If all states are fed back from the model itself, we get a simulation model, which is a special case of the recurrent structure. The advantage of such a model is its property to be independent from the real system. Thus, it is suitable for simulations, as it allows multi-step ahead predictions. In this report, we focus on two often-used recurrent structures: the Gaussian process state space model (GP-SSM) and the Gaussian process nonlinear error output (GP-NOE) model.

4.1 Gaussian Process State Space Models

Gaussian process state space models are structured as a discrete-time system. In this case, the states are the regressors, which is visualized in Fig. 9. This approach allows to be more efficient, since the regressors are less restricted in their internal structure as in input-output models. Thus, a very efficient model in terms of number of regressors might be possible. The mapping from the states to the output is often be assumed to be known. The situation, where the output mapping describes a known sensor model, is such an example. It is mentioned in [Fri+13] that using too flexible models for both, the state mapping f and the output mapping, can result in problems of non-identifiability. Therefore, we focus on a known output mapping. The mathematical model of the GP-SSM is thus given by

where ξ_t ∈ ℝ^n_ξ, n_ξ = n_x + n_u is the concatenation of the state vector xt ∈X ⊆ ℝ^n_x and the input ut ∈ U ⊆ ℝ^n_u such that ξ_t = [xt; u_t]. The mean function is given by continuous functions m¹, …, m^n_x: ℝ^n_ξ → ℝ. The output mapping is parametrized by a known vector γ_y ∈ ℝ^n_γ with n_γ ∈ ℕ. The system identification task for the GP-SSM mainly focuses on f in particular. It can be described as finding the state-transition probability conditioned on the observed training data.

Remark 6. The potentially unknown number of regressors can be determined using established nonlinear identification techniques as presented in [KL99], or exploiting embedded techniques such as automatic relevance determination [Koc16]. A mismatch leads to similar issues as in parametric system identification.

Figure 9: Structure of a GP-SSM with 𝔓 as backshift operator, such that 𝔓⁻¹x t+1 = xt

4.2 Gaussian Process Nonlinear Output Error Models

The GP-NOE model uses the past n_in ∈ ℕ_>0 input values ut ∈U and the past n_in ∈ ℕ_>0 output values yt ∈ ℝ^n_y of the model as the regressors. Figure 10 shows the structure of GP-NOE, where the outputs are feedbacked. Analogously to the GP-SSM, the mathematical model of the GP-NOE is given by

where ζt ∈ ℝ^n_ζ, n_ζ = n_inn_y + n_inn_u is the concatenation of the past outputs yt and inputs ut such that ζt = [y_{t−n_in+1}; …; y_t; u_{t−n_in+1}; …; u_t]. The mean function is given by continuous functions m¹, …, m^n_y: ℝ^n_ζ → ℝ. In contrast to nonlinear autoregressive exogenous models, that focus on one-step ahead prediction, a NOE model is more suitable for simulations as it considers the multi-step ahead prediction [Nel13]. However, the drawback is a more complex training procedure that requires a nonlinear optimization scheme due to their recurrent structure [Koc16].

Figure 10: Structure of a GP-NOE model with 𝔓 as backshift operator (𝔓⁻¹y t+1 = yt)

Remark 7. It is always possible to convert an identified input-output model into a state-space model, see[PL70]. However, focusing on state-space models only would preclude the development of a large number of useful identification results.

Remark 8. Control relevant properties of GP-SSMs and GO-NOE models are discussed in [BH16b; BH16a; BH20].

5 Summary

In this article, we introduce the GP and its usage in GPR. Based on the property, that every finite subset of a GP follows a multi-variate Gaussian distribution, a closed-form equation can be derived to predict the mean and variance for a new test point. The GPR can intrinsically handle noisy output data if it is Gaussian distributed. As GPR is a data-driven method, only little prior knowledge is necessary for the regression. Further, the complexity of GP models scales with the number of training points. A degree of freedom in the modeling is the selection of the kernel function and its hyperparameters. We present an overview of common kernels and the necessary properties to be a valid kernel function. For the hyperparameter determination, two approaches based on numerical optimization are shown. The kernel of the GP is uniquely related to a RKHS, which determines the shape of the samples of the GP. Based on this, we compare different approaches for the quantification of the model error that quantifies the error between the GPR and the actual data-generating function. Finally, we introduce how GP models can be used as dynamical systems in GP-SSMs and GP-NOE models.

6 Conditional Distribution

Let ν₁ ∈ ℝ^n_ν₁, ν₂ ∈ ℝ^n_ν₂ with n_ν₁, n_ν₁ ∈ ℕ be probability variables, which are multivariate Gaussian distribution

with mean μ₁ ∈ ℝ^n_ν₁, μ₂ ∈ ℝ^n_ν₂ and variance Σ₁₁ ∈ ℝ^{n_ν₁×n_ν₁}, Σ₁₂ ∈ ℝ^{n_ν₂×n_ν₁}, Σ₂₂ ∈ ℝ^{n_ν₂×n_ν₂}. The task is to determine the conditional probability

The joined probability p(ν₁, ν₂) is a multivariate Gaussian distribution with

where x = [x₁; x₂], x₁ ∈ ℝ^n_ν₁, x₂ ∈ ℝ^n_ν₂. The marginal distribution of ν₁ is defined by the mean μ₁ and the variance Σ₁₁ such that

The division of the joint distribution by the marginal distribution results again in a Gaussian distribution with

where the first part ∗ can be rewritten as

Thus, the covariance matrix Σ_22|1 of the conditional distribution p(ν₂|ν₁) is given by

For the simplification of the second part ∗∗ of (63), we exploit the special block structure of Σ, such that its inverse is given by

with N = (Σ₂₂ − Σ₂₁Σ₁₁⁻¹Σ₁₂)⁻¹. Thus, we compute ∗∗ as

Finally, the conditional probability is given with the conditional mean μ_2|1 ∈ ℝ^n_ν₂ and the conditional covariance matrix Σ_22|1 ∈ ℝ^{n_ν₂×n_ν₂} by

References

[ÁRL12]: Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. “Kernels for Vector-Valued Functions: A Review”. In: Foundations and Trends in Machine Learning 4.3 (2012), pp. 195–266. doi: 10.1561/2200000036.
[Aro50]: Nachman Aronszajn. “Theory of reproducing kernels”. In: Transactions of the American mathematical society 68.3 (1950), pp. 337–404. doi: 10.2307/1990404.
[Ber+16]: Felix Berkenkamp, Riccardo Moriconi, Angela P. Schoellig, and Andreas Krause. “Safe Learning of Regions of Attraction for Uncertain, Nonlinear Systems with Gaussian Processes”. In: 2016 IEEE 55th Conference on Decision and Control (CDC). 2016, pp. 4661–4666.
[Ber+17]: Felix Berkenkamp, Matteo Turchetta, Angela P. Schoellig, and Andreas Krause. “Safe Model-based Reinforcement Learning with Stability Guarantees”. In: Advances in Neural Information Processing Systems. 2017, pp. 908–918.
[BH16a]: Thomas Beckers and Sandra Hirche. “Equilibrium distributions and stability analysis of Gaussian Process State Space Models”. In: 2016 IEEE 55th Conference on Decision and Control (CDC). 2016, pp. 6355–6361. doi: 10.1109/CDC.2016.7799247.
[BH16b]: Thomas Beckers and Sandra Hirche. “Stability of Gaussian Process State Space Models”. In: 2016 European Control Conference (ECC). 2016, pp. 2275–2281. doi: 10.1109/ECC.2016.7810630.
[BH20]: Thomas Beckers and Sandra Hirche. “Prediction with Gaussian Process Dynamical Models”. In: Transaction on Automatic Control (2020). Submitted.
[Bis06]: Christopher Bishop. Pattern recognition and machine learning. Springer-Verlag New York, 2006. isbn: 9780387310732.
[BKH19]: Thomas Beckers, Dana Kulić, and Sandra Hirche. “Stable Gaussian Process based Tracking Control of Euler-Lagrange Systems”. In: Automatica 103 (2019), pp. 390–397. doi: 10.1016/j.automatica.2019.01.023.
[BLG16]: Yusuf Bhujwalla, Vincent Laurain, and Marion Gilson. “The impact of smoothness on model class selection in nonlinear system identification: An application of derivatives in the RKHS”. In: 2016 American Control Conference (ACC). 2016, pp. 1808–1813. doi: 10.1109/ACC.2016.7525181.
[BUH18]: Thomas Beckers, Jonas Umlauft, and Sandra Hirche. “Mean Square Prediction Error of Misspecified Gaussian Process Models”. In: 2018 IEEE Conference on Decision and Control (CDC). 2018, pp. 1162–1167. doi: 10.1109/CDC.2018.8619163.
[Cho+13]: Girish Chowdhary, Hassan Kingravi, Jonathan How, and Patricio A. Vela. “Bayesian nonparametric adaptive control of time-varying systems using Gaussian processes”. In: 2013 American Control Conference. 2013, pp. 2655–2661. doi: 10.1109/ACC.2013.6580235.
[DM+05]: Lokenath Debnath, Piotr Mikusinski, et al. Introduction to Hilbert spaces with applications. Academic press, 2005.
[FCR14]: Roger Frigola, Yutian Chen, and Carl E. Rasmussen. Variational Gaussian Process State-Space Models. 2014. arXiv: 1406.4905 [cs.LG].
[Fri+13]: Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen. “Bayesian inference and learning in Gaussian process state-space models with particle MCMC”. In: Advances in Neural Information Processing Systems. 2013, pp. 3156–3164.
[GE79]: Seymour Geisser and William F. Eddy. “A Predictive Approach to Model Selection”. In: Journal of the American Statistical Association 74.365 (1979), pp. 153–160. doi: 10.1080/01621459.1979.10481632.
[KL99]: Robert Keviczky and Haber Laszlo. Nonlinear system identification: input-output modeling approach. Springer Netherlands, 1999. isbn: 9780792358589.
[Koc16]: Juš Kocijan. Modelling and Control of Dynamic Systems Using Gaussian Process Models. Springer International Publishing, 2016. doi: 10.1007/978-3-319-21021-6.
[Mac97]: David J. MacKay. Gaussian Processes - A Replacement for Supervised Neural Networks? 1997.
[MXZ06]: Charles A. Micchelli, Yuesheng Xu, and Haizhang Zhang. “Universal kernels”. In: Journal of Machine Learning Research 7 (2006), pp. 2651–2667.
[Nel13]: Oliver Nelles. Nonlinear system identification: from classical approaches to neural networks and fuzzy models. Springer-Verlag Berlin Heidelberg, 2013. doi: 10.1007/978-3-662-04323-3.
[PL70]: M. Q. Phan and R. W. Longman. “Relationship between state-space and input-output models via observer Markov parameters”. In: WIT Transactions on The Built Environment 22 (1970). doi: 10.2495/DCSS960121.
[Rad96]: Neal M. Radford. Bayesian learning for neural networks. Springer-Verlag New York, 1996. doi: 10.1007/978-1-4612-0745-0.
[Ras06]: Carl E. Rasmussen. Gaussian processes for machine learning. The MIT Press, 2006. isbn: 9780262182539.
[SC04]: John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004. doi: 10.1017/CBO9780511809682.
[SC08]: Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. doi: 10.1007/978-0-387-77242-4.
[SHS06]: Ingo Steinwart, Don Hush, and Clint Scovel. “An explicit description of the reproducing kernel Hilbert spaces of Gaussian RBF kernels”. In: IEEE Transactions on Information Theory 52.10 (2006), pp. 4635–4643. doi: 10.1109/TIT.2006.881713.
[Sjö+95]: Jonas Sjöberg, Qinghua Zhang, Lennart Ljung, Albert Benveniste, Bernard Delyon, Pierre-Yves Glorennec, Håkan Hjalmarsson, and Anatoli Juditsky. “Nonlinear black-box modeling in system identification: a unified overview”. In: Automatica 31.12 (1995), pp. 1691–1724. doi: 10.1016/0005-1098(95)00120-8.
[Sri+12]: Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias W. Seeger. “Information-theoretic regret bounds for Gaussian process optimization in the bandit setting”. In: IEEE Transactions on Information Theory 58.5 (2012), pp. 3250–3265. doi: 10.1109/TIT.2011.2182033.
[UBH18]: Jonas Umlauft, Thomas Beckers, and Sandra Hirche. “A Scenario-based Optimal Control Approach for Gaussian Process State Space Models”. In: 2018 European Control Conference (ECC). 2018, pp. 1386–1392. doi: 10.23919/ECC.2018.8550458.
[Wah90]: Grace Wahba. Spline models for observational data. SIAM, 1990. doi: 10.1137/1.9781611970128.
[WHB05]: Jack Wang, Aaron Hertzmann, and David M. Blei. “Gaussian process dynamical models”. In: Proceedings of the 18th International Conference on Neural Information Processing System. 2005, pp. 1441–1448. doi: 10.5555/2976248.2976429.
[WR96]: Christopher K. Williams and Carl E. Rasmussen. “Gaussian processes for regression”. In: Advances in neural information processing systems. 1996, pp. 514–520.

Contents