Gaussian Process

Published on: January 21, 2024

"Let the data speak, if you bring subjective modelling then it’s dangerous." - Someone in YouTube

To find the non linear relationship between the variables we can use non-linear functions such as a polynomial or exponential function. But If we don’t have information on the nature of data then finding the appropriate function can be challenging. For this we can use non-parametric methods like the Gaussian process(GP).

“The idea of Gaussian process modelling is, without parameterizing y(x)(Prediction function), to place a prior P(y(x)) directly on the space of functions.” - David J.C. Mackay

GP is a method of supervised machine learning which is used for regression and probabilistic classification. It provides the distribution over the samples of prediction functions. It also gives uncertainties present in the prediction.

To call it nonparametric would be a misnomer. Nonparametric does not imply the absence of parameters but rather predictions are obtained without giving the unknown function y(x) an explicit parameterization. The parameters present usually increase infinitetly as the model sees more data.

We can view it as a generalisation of the multivariate gaussian distribution over the infinite dimension. Infinite dimension here means that we have a prediction function that can map infinitely many input values to outputs. As gaussian distribution is characterised by mean and covariance matrix, GP is also characterised by mean function and a covariance function which is given by a kernel.

P (x) \sim G P (m (x), k (x, x^{'}))

We keep the mean function zero as the mean does not carry any important information and it’s the covariance function that captures the information required for modelling the GP. Consequently, the covariance matrix determines which type of functions from the space of all possible functions are more probable.

Kernel Function

Kernel is responsible for giving similarity between two variables. We can have functions like RBF, Periodic, linear or combination of kernels. Prediction functions that are likely to be sampled are controlled by the kernel.

The kernel function recives two points

t, t^{'} \in R^{n}

as an input and returns a similarity score between those two in the form of scaler.

k : R^{n} X R^{n} \to R, \sum = c o v (X, X^{'}) = k (t, t^{'})

An important property of gaussian distribution that makes GP possible is that the gaussian distribution is closed under marginalisation and conditioning. Which means that the resulting distribution from these operations are also gaussian and this is an important property as this ensures that the obtained results are mathematically tractable.

Marginalisation

Marginalization is summing out the probability of random variable X, given the joint probability distribution of X with other variables.

P (X = x) = \sum_{0}^{n} P (X = x | Y_{n})

Conditioning

Conditioning determines the probability of one variable depending on another variable. This allows us to perform Bayesian inference. Through conditioning, we can update our prior beliefs to obtain new distribution as we observe new data points.

p (Y | X) = P (X \cap Y) / P (X)

References

A Visual Exploration of Gaussian Processes | Here
Gaussian Processes | Here
Introduction to Gaussian Processes | Here