You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed some major time differences regarding multi-task GPs during training, which resulted in a question, for which I wasn't able to figure out an answer to.
TL;DR
Why do we not use the preconditioned linear conjugate gradient method, when computing the log likelihood for multi-task GPs, but instead compute a costly eigendecomposition with torch.linalg.eigh?
Too long, but want to read anyway to get some context
(I'm new to GPyTorch and Gaussian Processes in general, so there might be some mistakes in the following sections and corrections are very much appreciated!)
The current implementation of gpytorch.kernels.MultitaskKernel (as used in Multitask GP Regression) returns a linear_operator.operators.KroneckerProductLinearOperator. This leads to a call of the inv_quad_logdet method of linear_operator.operators.KroneckerProductAddedDiagLinearOperator, when computing the log probability, assuming we are using a gpytorch.likelihoods.MultitaskGaussianLikelihood with a strictly diagonal task noise covariance matrix (rank=0).
The inv_quad_logdet() method is returning the inverse quadratic term $Y^T\left(K_f\otimes K_{XX}+\Sigma_{NM}\right)^{-1}Y$, as well as the log determinant $\log\left(\left|K_f\otimes K_{XX}+\Sigma_{NM}\right|\right)$. For the sake of brevity, let's only take a look at the inverse quadratic term. For this, we end up in the _solve method of KroneckerProductAddedDiagLinearOperator which is effectively computing the solve utilizing eigendecomposition as
Resulting in a overall time complexity of $O(M^3+N^3)$ (as far as I can tell) or in my use case (where $N >> M$) a time complexity of $O(N^3)$.
Preconditioned linear conjugate gradient method
For a project, I'm currently implementing an approximated RBF-Kernel as described by Joukov and Kulić. This is already working great for single-task GPs (e.g. when using Simple GP Regression, substituting the RBF-Kernel with the approximated Kernel), since we benefit greatly from the Woodbury matrix identity, which is already implemented when computing the preconditioner, leading to a much smaller matrix that needs to be inversed and therefore less training time (without going too much into the details, since my question is about multi-task GPs in general). However in a multi-task context, we don't use the preconditioned linear conjugate gradient method (linear cg), but instead the method described above, where the larger matrix $K_{XX}$ is evaluated beforehand, resulting in no benefits when using the approximated kernel.
So I naively figured: "Let's try to force the usage of linear cg for multi-task GPs":
For this I created the class gpytorch.kernels.MultitaskKernelLinearCG
from . importMultitaskKernelfromlinear_operatorimportto_linear_operatorfrom ..lazyimportKroneckerProductLinearOperatorLinearCGclassMultitaskKernelLinearCG(MultitaskKernel):
defforward(self, x1, x2, diag=False, last_dim_is_batch=False, **params):
iflast_dim_is_batch:
raiseRuntimeError("MultitaskKernel does not accept the last_dim_is_batch argument.")
covar_i=self.task_covar_module.covar_matrixiflen(x1.shape[:-2]):
covar_i=covar_i.repeat(*x1.shape[:-2], 1, 1)
covar_x=to_linear_operator(self.data_covar_module.forward(x1, x2, **params))
res=KroneckerProductLinearOperatorLinearCG(covar_x, covar_i)
returnres.diagonal(dim1=-1, dim2=-2) ifdiagelseres
which is returning a gpytorch.lazy.KroneckerProductLinearOperatorLinearCG, which in turn is implemented as
This leads to a call of the inv_quad_logdet method of linear_operator.operators.AddedDiagLinearOperator (again strictly assuming a MultitaskGaussianLikelihood with rank=0).
In here, we firstly define a precondition closure $P^{-1}(\cdot)$ using a pivoted Cholesky decomposition as
in $O(k^2N)$, where k is the rank of our covariance matrix $K_f\otimes K_{XX}$. The precondition closure is then used for the linear cg, which iteratively solves
in $O(m\sqrt{\kappa})$, where $m$ is the number of elements $\neq 0$ and $\kappa = \frac{\lambda_{max}}{\lambda_{min}}$ is the condition number. Resulting in a total time complexity of $O(k^2N + m\sqrt{\kappa})$
Some simple tests
With this I did some basic tests with synthetic data:
The conventional multi-task GP (using eigendecomposition) finished training after almost $58$ seconds, while the training of a GP using linear cg only took around $2$ seconds (for 50 iterations), resulting in basically the same covariance matrices, where the root mean square error between them is only $1e-3$ (We obviously expect them to be not exactly the same, since linear cg is approximating in a way).
In a second test I repeated the training for different nb_training_points, while keeping track of the training time. Without including the code (I basically repeat the first test, but starting of at nb_training_points = 2000 and increasing it by 2000 for each run), I can plot the training time over the number of training data points, where I stopped training the "Conventional GP" after 10,000 points, since it already took more than 30 minutes to train:
Which further shows the differences (Note the log scale).
Finally coming to my actual question: Why do we not use the linear cg method, which seems to be way faster than the eigendecomposition approach? Are there cases where the eigendecomposition approach is faster? Is my simple example too simple? Do we usually expect the condition number $\kappa$ to be much worse? Am I missing some other reasons? I can't seem to make sense of this 😓
With the bonus question: What happens at around 24,000 and 34,000 training points? How can we explain the large jumps in training time?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello there,
I've noticed some major time differences regarding multi-task GPs during training, which resulted in a question, for which I wasn't able to figure out an answer to.
TL;DR
Why do we not use the preconditioned linear conjugate gradient method, when computing the log likelihood for multi-task GPs, but instead compute a costly eigendecomposition with
torch.linalg.eigh
?Too long, but want to read anyway to get some context
(I'm new to GPyTorch and Gaussian Processes in general, so there might be some mistakes in the following sections and corrections are very much appreciated!)
Let's say our log likelihood is defined as
with
Current implementation for multi-task GPs
The current implementation of
gpytorch.kernels.MultitaskKernel
(as used in Multitask GP Regression) returns alinear_operator.operators.KroneckerProductLinearOperator
. This leads to a call of theinv_quad_logdet
method oflinear_operator.operators.KroneckerProductAddedDiagLinearOperator
, when computing the log probability, assuming we are using agpytorch.likelihoods.MultitaskGaussianLikelihood
with a strictly diagonal task noise covariance matrix (rank=0
).The$Y^T\left(K_f\otimes K_{XX}+\Sigma_{NM}\right)^{-1}Y$ , as well as the log determinant $\log\left(\left|K_f\otimes K_{XX}+\Sigma_{NM}\right|\right)$ . For the sake of brevity, let's only take a look at the inverse quadratic term. For this, we end up in the
inv_quad_logdet()
method is returning the inverse quadratic term_solve
method ofKroneckerProductAddedDiagLinearOperator
which is effectively computing the solve utilizing eigendecomposition aswith
and
torch.linalg.eigh
Resulting in a overall time complexity of$O(M^3+N^3)$ (as far as I can tell) or in my use case (where $N >> M$ ) a time complexity of $O(N^3)$ .
Preconditioned linear conjugate gradient method
For a project, I'm currently implementing an approximated RBF-Kernel as described by Joukov and Kulić. This is already working great for single-task GPs (e.g. when using Simple GP Regression, substituting the RBF-Kernel with the approximated Kernel), since we benefit greatly from the Woodbury matrix identity, which is already implemented when computing the preconditioner, leading to a much smaller matrix that needs to be inversed and therefore less training time (without going too much into the details, since my question is about multi-task GPs in general). However in a multi-task context, we don't use the preconditioned linear conjugate gradient method (linear cg), but instead the method described above, where the larger matrix$K_{XX}$ is evaluated beforehand, resulting in no benefits when using the approximated kernel.
So I naively figured: "Let's try to force the usage of linear cg for multi-task GPs":
For this I created the class
gpytorch.kernels.MultitaskKernelLinearCG
which is returning a
gpytorch.lazy.KroneckerProductLinearOperatorLinearCG
, which in turn is implemented asThis leads to a call of the$P^{-1}(\cdot)$ using a pivoted Cholesky decomposition as
inv_quad_logdet
method oflinear_operator.operators.AddedDiagLinearOperator
(again strictly assuming aMultitaskGaussianLikelihood
withrank=0
).In here, we firstly define a precondition closure
in$O(k^2N)$ , where k is the rank of our covariance matrix $K_f\otimes K_{XX}$ . The precondition closure is then used for the linear cg, which iteratively solves
in$O(m\sqrt{\kappa})$ , where $m$ is the number of elements $\neq 0$ and $\kappa = \frac{\lambda_{max}}{\lambda_{min}}$ is the condition number. Resulting in a total time complexity of $O(k^2N + m\sqrt{\kappa})$
Some simple tests
With this I did some basic tests with synthetic data:
For my machine (using a NIVIDIA GeForce RTX 3080), I get the following output:
The conventional multi-task GP (using eigendecomposition) finished training after almost$58$ seconds, while the training of a GP using linear cg only took around $2$ seconds (for 50 iterations), resulting in basically the same covariance matrices, where the root mean square error between them is only $1e-3$ (We obviously expect them to be not exactly the same, since linear cg is approximating in a way).
In a second test I repeated the training for different

nb_training_points
, while keeping track of the training time. Without including the code (I basically repeat the first test, but starting of atnb_training_points = 2000
and increasing it by 2000 for each run), I can plot the training time over the number of training data points, where I stopped training the "Conventional GP" after 10,000 points, since it already took more than 30 minutes to train:Which further shows the differences (Note the log scale).
Finally coming to my actual question:$\kappa$ to be much worse? Am I missing some other reasons? I can't seem to make sense of this 😓
Why do we not use the linear cg method, which seems to be way faster than the eigendecomposition approach? Are there cases where the eigendecomposition approach is faster? Is my simple example too simple? Do we usually expect the condition number
With the bonus question:
What happens at around 24,000 and 34,000 training points? How can we explain the large jumps in training time?
Some guidance would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions