n
f (x)|f (x1:n) ∼ Normal(µn(x), σ2 (x))
µn(x) = Σ0(x, x1:n)Σ0(x1:n, x1:n)−1 (f (x1:n) − µ0(x1:n)) + µ0(x)
n
σ2 (x) = Σ0(x, x) − Σ0(x, x1:n)Σ0(x1:n, x1:n)−1Σ0(x1:n, x).
(3)
This conditional distribution is called the posterior probability distribution in the nomenclature of
Figure 2: Random functions f drawn from a Gaussian process prior with a power exponential kernel. Each plot corresponds to a different value for the parameter α1, with α1 decreasing from left to right. Varying this parameter creates different beliefs about how quickly f (x) changes with x.
n
Bayesian statistics. The posterior mean µn(x) is a weighted average between the prior µ0(x) and an estimate based on the data f (x1:n), with a weight that depends on the kernel. The posterior variance σ2 (x) is equal to the prior covariance Σ0(x, x) less a term that corresponds to the variance removed by observing f (x1:n).
Rather than computing posterior means and variances directly using (3) and matrix inversion, it is typically faster and more numerically stable to use a Cholesky decomposition and then solve a linear system of equations. This more sophisticated technique is discussed as Algorithm 2.1 in Section 2.2 of Rasmussen and Williams (2006). Additionally, to improve the numerical stability of this approach or direct computation using (3), it is often useful to add a small positive number like 10−6 to each element of the diagonal of Σ0(x1:n, x1:n), especially when x1:n contains two or more points that are close together. This prevents eigenvalues of Σ0(x1:n, x1:n) from being too close to 0, and only changes the predictions that would be made by an infinite-precision computation by a small amount.
Although we have modeled f at only a finite number of points, the same approach can be used when modeling f over a continuous domain A. Formally a Gaussian process with mean function µ0 and kernel Σ0 is a probability distribution over the function f with the property that, for any given collection of points x1:k, the marginal probability distribution on f (x1:k) is given by (2). Moreover, the arguments that justified (3) still hold when our prior probability distribution on f is a Gaussian process.
In addition to calculating the conditional distribution of f (x) given f (x1:n), it is also possible to calculate the conditional distribution of f at more than one unevaluated point. The resulting distribution is multivariate normal, with a mean vector and covariance kernel that depend on the location of the unevaluated points, the locations of the measured points x1:n, and their measured values f (x1:n). The functions that give entries in this mean vector and covariance matrix have the form required for a mean function and kernel described above, and the conditional distribution of f given f (x1:n) is a Gaussian process with this mean function and covariance kernel.
Choosing a Mean Function and Kernel
′ ′′
We now discuss the choice of kernel. Kernels typically have the property that points closer in the input space are more strongly correlated, i.e., that if ||x − x′|| < ||x − x′′|| for some norm || · ||, then Σ0(x, x ) > Σ0(x, x ). Additionally, kernels are required to be positive semi-definite functions. Here we describe two example kernels and how they are used.
One commonly used and simple kernel is the power exponential or Gaussian kernel,
Σ0(x, x′) = α0 exp −||x − x′||2 ,
i=1
where ||x − x′||2 = Σd αi(xi − xi′ )2, and α0:d are parameters of the kernel. Figure 2 shows random
functions with a 1-dimensional input drawn from a Gaussian process prior with a power exponential
kernel with different values of α1. Varying this parameter creates different beliefs about how quickly
f (x) changes with x.
Kν (
2ν||x − x ||)
Another commonly used kernel is the M`atern kernel,
Σ0(x, x ) = α0 Γ(ν)
2ν||x − x ||
′ 21−ν √
′ ν √ ′
where Kν is the modified Bessel function, and we have a parameter ν in addition to the parameters α0:d. We discuss choosing these parameters below in Section 3.2.
Σ
Perhaps the most common choice for the mean function is a constant value, µ0(x) = µ. When f is believed to have a trend or some application-specific parametric structure, we may also take the mean function to be
p
µ0(x) = µ + βiΨi(x), (4)
i=1
where each Ψi is a parametric function, and often a low-order polynomial in x.
Choosing Hyperparameters
The mean function and kernel contain parameters. We typically call these parameters of the prior hyperparameters. We indicate them via a vector η. For example, if we use a M`atern kernel and a constant mean function, η = (α0:d, ν, µ).
To choose the hyperparameters, three approaches are typically considered. The first is to find the maximum likelihood estimate (MLE). In this approach, when given observations f (x1:n), we calculate the likelihood of these observations under the prior, P (f (x1:n)|η), where we modify our notation to indicate its dependence on η. This likelihood is a multivariate normal density. Then, in maximum likelihood estimation, we set η to the value that maximizes this likelihood,
|
ηˆ = argmax P (f (x1:n) η)
η
The second approach amends this first approach by imagining that the hyperparameters η were themselves chosen from a prior, P (η). We then estimate η by the maximum a posteriori (MAP) estimate (Gelman et al., 2014), which is the value of η that maximizes the posterior,
ηˆ = argmax P (η|f (x1:n)) = argmax P (f (x1:n)|η)P (η)
η η
¸
In moving from the first expression to the second we have used Bayes’ rule and then dropped a normal- ization constant P (f (x1:n)|η′)P (η′) dη′ that does not depend on the quantity η being optimized.
The MLE is a special case of the MAP if we take the prior on the hyperparameters P (η) to be the
(possibly degenerate) probability distribution that has constant density over the domain of η. The MAP is useful if the MLE sometimes estimates unreasonable hyperparameter values, for example, corresponding to functions that vary too quickly or too slowly (see Figure 2). By choosing a prior that puts more weight on hyperparameter values that are reasonable for a particular problem, MAP estimates can better correspond to the application. Common choices for the prior include the uniform distribution (for preventing estimates from falling outside of some pre-specified range), the normal distribution (for suggesting that the estimates fall near some nominal value without setting a hard cutoff), and the log- normal and truncated normal distributions (for providing a similar suggestion for positive parameters).
The third approach is called the fully Bayesian approach. In this approach, we wish to compute the posterior distribution on f (x) marginalizing over all possible values of the hyperparameters,
Σ≈ |
P (f (x) = y|f (x1:n)) = ∫ P (f (x) = y|f (x1:n), η)P (η|f (x1:n)) dη (5) This integral is typically intractable, but we can approximate it through sampling:
P ( f ( x) = y| f ( x
1:n
J
)) 1 P ( f ( x) = y f ( x J
j=1
1:n
), η = ηˆj ) (6)
where (ηˆj : j = 1, . . . , J) are sampled from P (η|f (x1:n)) via an MCMC method, e.g., slice sampling (Neal, 2003). MAP estimation can be seen as an approximation to fully Bayesian inference: if we approximate the posterior P (η|f (x1:n)) by a point mass at the η that maximizes the posterior density, then inference with the MAP recovers (5).
Dostları ilə paylaş: |