Why do Multi-task GPs not utilizie preconditioned linear cg? #2648

Cef0PT · 2025-03-22T17:11:14Z

Cef0PT
Mar 22, 2025

Hello there,

I've noticed some major time differences regarding multi-task GPs during training, which resulted in a question, for which I wasn't able to figure out an answer to.

TL;DR

Why do we not use the preconditioned linear conjugate gradient method, when computing the log likelihood for multi-task GPs, but instead compute a costly eigendecomposition with torch.linalg.eigh?

Too long, but want to read anyway to get some context

(I'm new to GPyTorch and Gaussian Processes in general, so there might be some mistakes in the following sections and corrections are very much appreciated!)

Let's say our log likelihood is defined as

$$\log\left(P(Y|X)\right)=-\frac{1}{2}\left(Y^T\left(K_f\otimes K_{XX}+\Sigma_{NM}\right)^{-1}Y+\log\left(\left|K_f\otimes K_{XX}+\Sigma_{NM}\right|\right)+NM\log(2\pi)\right)$$

with

the $M\times M$ task kernel $K_f$
the $N\times N$ data kernel $K_{XX}$
the noise covariance $\Sigma_{NM}$

Current implementation for multi-task GPs

The current implementation of gpytorch.kernels.MultitaskKernel (as used in Multitask GP Regression) returns a linear_operator.operators.KroneckerProductLinearOperator. This leads to a call of the inv_quad_logdet method of linear_operator.operators.KroneckerProductAddedDiagLinearOperator, when computing the log probability, assuming we are using a gpytorch.likelihoods.MultitaskGaussianLikelihood with a strictly diagonal task noise covariance matrix (rank=0).

The inv_quad_logdet() method is returning the inverse quadratic term $Y^T\left(K_f\otimes K_{XX}+\Sigma_{NM}\right)^{-1}Y$, as well as the log determinant $\log\left(\left|K_f\otimes K_{XX}+\Sigma_{NM}\right|\right)$. For the sake of brevity, let's only take a look at the inverse quadratic term. For this, we end up in the _solve method of KroneckerProductAddedDiagLinearOperator which is effectively computing the solve utilizing eigendecomposition as

$$\left(K_f\otimes K_{XX}+\Sigma_{NM}\right)^{-1}Y = \sqrt{\Sigma_{NM}}^{-1}\Phi(\lambda + I_{NM})^{-1}\Phi^T\sqrt{\Sigma_{NM}}^{-1}Y$$

with

$\Phi = \Phi_{XX}\otimes \Phi_{\Sigma}$
$\lambda = \lambda_{XX}\lambda_{\Sigma}$

and

$\Phi_{XX}\lambda_{XX}\Phi_{XX}^T = K_{XX}$
$\Phi_{\Sigma}\lambda_{\Sigma}\Phi_{\Sigma}^T =\sqrt{\Sigma_{NM}}^{-1}K_f\sqrt{\Sigma_{NM}}^{-1}$
both computed with torch.linalg.eigh

Resulting in a overall time complexity of $O(M^3+N^3)$ (as far as I can tell) or in my use case (where $N >> M$) a time complexity of $O(N^3)$.

Preconditioned linear conjugate gradient method

For a project, I'm currently implementing an approximated RBF-Kernel as described by Joukov and Kulić. This is already working great for single-task GPs (e.g. when using Simple GP Regression, substituting the RBF-Kernel with the approximated Kernel), since we benefit greatly from the Woodbury matrix identity, which is already implemented when computing the preconditioner, leading to a much smaller matrix that needs to be inversed and therefore less training time (without going too much into the details, since my question is about multi-task GPs in general). However in a multi-task context, we don't use the preconditioned linear conjugate gradient method (linear cg), but instead the method described above, where the larger matrix $K_{XX}$ is evaluated beforehand, resulting in no benefits when using the approximated kernel.

So I naively figured: "Let's try to force the usage of linear cg for multi-task GPs":
For this I created the class gpytorch.kernels.MultitaskKernelLinearCG

from . import MultitaskKernel
from linear_operator import to_linear_operator

from ..lazy import KroneckerProductLinearOperatorLinearCG


class MultitaskKernelLinearCG(MultitaskKernel):
    def forward(self, x1, x2, diag=False, last_dim_is_batch=False, **params):
        if last_dim_is_batch:
            raise RuntimeError("MultitaskKernel does not accept the last_dim_is_batch argument.")
        covar_i = self.task_covar_module.covar_matrix
        if len(x1.shape[:-2]):
            covar_i = covar_i.repeat(*x1.shape[:-2], 1, 1)
        covar_x = to_linear_operator(self.data_covar_module.forward(x1, x2, **params))
        res = KroneckerProductLinearOperatorLinearCG(covar_x, covar_i)
        return res.diagonal(dim1=-1, dim2=-2) if diag else res

which is returning a gpytorch.lazy.KroneckerProductLinearOperatorLinearCG, which in turn is implemented as

from linear_operator.operators import KroneckerProductLinearOperator, KroneckerProductDiagLinearOperator
from linear_operator.operators.diag_linear_operator import ConstantDiagLinearOperator


class KroneckerProductLinearOperatorLinearCG(KroneckerProductLinearOperator):
    def __add__(self, other):
        if isinstance(other, (KroneckerProductDiagLinearOperator, ConstantDiagLinearOperator)):
            from linear_operator.operators.added_diag_linear_operator import (
                AddedDiagLinearOperator,
            )
            return AddedDiagLinearOperator(self, other)
        return super().__add__(other)

This leads to a call of the inv_quad_logdet method of linear_operator.operators.AddedDiagLinearOperator (again strictly assuming a MultitaskGaussianLikelihood with rank=0).
In here, we firstly define a precondition closure $P^{-1}(\cdot)$ using a pivoted Cholesky decomposition as

$$P = LL^T+\Sigma_{NM} \approx K_f\otimes K_{XX} + \Sigma_{NM}$$

in $O(k^2N)$, where k is the rank of our covariance matrix $K_f\otimes K_{XX}$. The precondition closure is then used for the linear cg, which iteratively solves

$$P^{-1}\left(K_f\otimes K_{XX} + \Sigma_{NM}\right)R = P^{-1}Y \quad\text{for}\quad R = \left(K_f\otimes K_{XX} + \Sigma_{NM}\right)^{-1}Y$$

in $O(m\sqrt{\kappa})$, where $m$ is the number of elements $\neq 0$ and $\kappa = \frac{\lambda_{max}}{\lambda_{min}}$ is the condition number. Resulting in a total time complexity of $O(k^2N + m\sqrt{\kappa})$

Some simple tests

With this I did some basic tests with synthetic data:

import math
from time import perf_counter

import torch
import gpytorch
import numpy as np

torch.manual_seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# set up some simple training data
nb_training_points = 3000
nb_test_points = 2000

train_x = torch.linspace(0, 1, nb_training_points, device=device)

train_y = torch.stack([
    torch.sin(train_x * (2 * math.pi)) + torch.randn(train_x.size(), device=device) * math.sqrt(0.04),
    torch.cos(train_x * (2 * math.pi)) + torch.randn(train_x.size(), device=device) * math.sqrt(0.04),
], -1)

# define the models
class ConventionalGPModel(gpytorch.models.ExactGP):
    def __init__(self, train_inputs, train_targets, likelihood):
        super(ConventionalGPModel, self).__init__(train_inputs, train_targets, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        self.covar_module = gpytorch.kernels.MultitaskKernel(
            gpytorch.kernels.RBFKernel(), num_tasks=2, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

class LinearCGModel(gpytorch.models.ExactGP):
    def __init__(self, train_inputs, train_targets, likelihood):
        super(LinearCGModel, self).__init__(train_inputs, train_targets, likelihood)
        self.mean_module = gpytorch.means.MultitaskMean(
            gpytorch.means.ConstantMean(), num_tasks=2
        )
        self.covar_module = gpytorch.kernels.MultitaskKernelLinearCG(
            gpytorch.kernels.RBFKernel(), num_tasks=2, rank=1
        )

    def forward(self, x):
        mean_x = self.mean_module(x)
        covar_x = self.covar_module(x)
        return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)

conv_likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
conv_likelihood.to(device)
conv_model = ConventionalGPModel(train_x, train_y, conv_likelihood)
conv_model.to(device)

linear_cg_likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
linear_cg_likelihood.to(device)
linear_cg_model = LinearCGModel(train_x, train_y, linear_cg_likelihood)
linear_cg_model.to(device)

# initialize task kernel to make GPs comparable
hypers = {
    "covar_module.task_covar_module.covar_factor": torch.randn(2, 1, device=device),
    "covar_module.task_covar_module.raw_var": torch.randn(2, device=device)
}
conv_model.initialize(**hypers)
linear_cg_model.initialize(**hypers)

# set models in training mode
conv_model.train()
linear_cg_model.train()
conv_likelihood.train()
linear_cg_likelihood.train()

# use adam optimizer, including the GaussianLikelihood parameters
conv_optimizer = torch.optim.Adam(conv_model.parameters(), lr=0.01)
linear_cg_optimizer = torch.optim.Adam(linear_cg_model.parameters(), lr=0.01)

# "Loss" for GPs -> marginal log likelihood
conv_mll = gpytorch.mlls.ExactMarginalLogLikelihood(conv_likelihood, conv_model)
linear_cg_mll = gpytorch.mlls.ExactMarginalLogLikelihood(linear_cg_likelihood, linear_cg_model)

# train GPs and measure execution time
print("Conventional:")
start = perf_counter()
for i in range(50):
    conv_optimizer.zero_grad()
    conv_output = conv_model(train_x)
    conv_loss = -conv_mll(conv_output, train_y)
    conv_loss.backward()
    if i == 0 or (i + 1) % 10 == 0:
        task_noises = conv_model.likelihood.task_noises.tolist()
        print(
            f"\tIter {i + 1:02d}/50   "
            f"Loss: {' ' if conv_loss.item() >= 0 else ''}{conv_loss.item():.3f}   "
            f"lengthscale: {conv_model.covar_module.data_covar_module.lengthscale.item():.3f}   "
            f"task_noises: {np.around(np.array(task_noises), 3).tolist()}   "
            f"global_noise: {conv_model.likelihood.noise.item():.3f}"
        )
    conv_optimizer.step()
time_diff = perf_counter() - start
print(f"Taining in {time_diff:.3f} seconds.")

print("Linear CG")
start = perf_counter()
for i in range(50):
    linear_cg_optimizer.zero_grad()
    linear_cg_output = linear_cg_model(train_x)
    linear_cg_loss = -linear_cg_mll(linear_cg_output, train_y)
    linear_cg_loss.backward()
    if i == 0 or (i + 1) % 10 == 0:
        task_noises = linear_cg_model.likelihood.task_noises.tolist()
        print(
            f"\tIter {i + 1:02d}/50   "
            f"Loss: {' ' if linear_cg_loss.item() >= 0 else ''}{linear_cg_loss.item():.3f}   "
            f"lengthscale: {linear_cg_model.covar_module.data_covar_module.lengthscale.item():.3f}   "
            f"task_noises: {np.around(np.array(task_noises), 3).tolist()}   "
            f"global_noise: {linear_cg_model.likelihood.noise.item():.3f}"
        )
    linear_cg_optimizer.step()
time_diff = perf_counter() - start
print(f"Taining in {time_diff:.3f} seconds.")

# resulting covariance matrices
with torch.no_grad():
    print("\n\nResulting covariance matrices")
    print("--Conventional--")
    conv_f_train = conv_model(train_x)
    conv_train_covar = conv_f_train.covariance_matrix
    print(conv_train_covar)

    print("\n--Linear CG--")
    linear_cg_f_train = linear_cg_model(train_x)
    linear_cg_train_covar = linear_cg_f_train.covariance_matrix
    print(linear_cg_train_covar)

    rmse = torch.sqrt(torch.mean((conv_train_covar - linear_cg_train_covar) ** 2))
    print(f"\nRMSE = {rmse.item():.3f}")

For my machine (using a NIVIDIA GeForce RTX 3080), I get the following output:

Conventional:
	Iter 01/50   Loss:  1.128   lengthscale: 0.693   task_noises: [0.693, 0.693]   global_noise: 0.693
	Iter 10/50   Loss:  1.093   lengthscale: 0.649   task_noises: [0.649, 0.649]   global_noise: 0.649
	Iter 20/50   Loss:  1.052   lengthscale: 0.602   task_noises: [0.602, 0.603]   global_noise: 0.602
	Iter 30/50   Loss:  1.011   lengthscale: 0.558   task_noises: [0.558, 0.558]   global_noise: 0.558
	Iter 40/50   Loss:  0.969   lengthscale: 0.517   task_noises: [0.515, 0.516]   global_noise: 0.515
	Iter 50/50   Loss:  0.927   lengthscale: 0.481   task_noises: [0.475, 0.476]   global_noise: 0.475
Taining in 57.727 seconds.
Linear CG
	Iter 01/50   Loss:  1.128   lengthscale: 0.693   task_noises: [0.693, 0.693]   global_noise: 0.693
	Iter 10/50   Loss:  1.096   lengthscale: 0.649   task_noises: [0.652, 0.663]   global_noise: 0.649
	Iter 20/50   Loss:  1.056   lengthscale: 0.602   task_noises: [0.608, 0.615]   global_noise: 0.603
	Iter 30/50   Loss:  1.015   lengthscale: 0.558   task_noises: [0.566, 0.57]   global_noise: 0.558
	Iter 40/50   Loss:  0.973   lengthscale: 0.517   task_noises: [0.524, 0.526]   global_noise: 0.516
	Iter 50/50   Loss:  0.932   lengthscale: 0.481   task_noises: [0.484, 0.485]   global_noise: 0.475
Taining in 2.170 seconds.


Resulting covariance matrices
--Conventional--
tensor([[ 1.4807, -1.5081,  1.4807,  ..., -0.1695,  0.1662, -0.1693],
        [-1.5081,  2.7879, -1.5081,  ...,  0.3134, -0.1693,  0.3129],
        [ 1.4807, -1.5081,  1.4807,  ..., -0.1698,  0.1664, -0.1695],
        ...,
        [-0.1695,  0.3134, -0.1698,  ...,  2.7879, -1.5081,  2.7879],
        [ 0.1662, -0.1693,  0.1664,  ..., -1.5081,  1.4807, -1.5081],
        [-0.1693,  0.3129, -0.1695,  ...,  2.7879, -1.5081,  2.7879]],
       device='cuda:0', grad_fn=<MatmulBackward>)

--Linear CG--
tensor([[ 1.4807, -1.5072,  1.4807,  ..., -0.1696,  0.1664, -0.1693],
        [-1.5072,  2.7843, -1.5072,  ...,  0.3133, -0.1693,  0.3128],
        [ 1.4807, -1.5072,  1.4807,  ..., -0.1698,  0.1666, -0.1696],
        ...,
        [-0.1696,  0.3133, -0.1698,  ...,  2.7843, -1.5072,  2.7843],
        [ 0.1664, -0.1693,  0.1666,  ..., -1.5072,  1.4807, -1.5072],
        [-0.1693,  0.3128, -0.1696,  ...,  2.7843, -1.5072,  2.7843]],
       device='cuda:0', grad_fn=<MatmulBackward>)

RMSE = 0.001

The conventional multi-task GP (using eigendecomposition) finished training after almost $58$ seconds, while the training of a GP using linear cg only took around $2$ seconds (for 50 iterations), resulting in basically the same covariance matrices, where the root mean square error between them is only $1e-3$ (We obviously expect them to be not exactly the same, since linear cg is approximating in a way).

In a second test I repeated the training for different nb_training_points, while keeping track of the training time. Without including the code (I basically repeat the first test, but starting of at nb_training_points = 2000 and increasing it by 2000 for each run), I can plot the training time over the number of training data points, where I stopped training the "Conventional GP" after 10,000 points, since it already took more than 30 minutes to train:

Which further shows the differences (Note the log scale).

Finally coming to my actual question:
Why do we not use the linear cg method, which seems to be way faster than the eigendecomposition approach? Are there cases where the eigendecomposition approach is faster? Is my simple example too simple? Do we usually expect the condition number $\kappa$ to be much worse? Am I missing some other reasons? I can't seem to make sense of this 😓

With the bonus question:
What happens at around 24,000 and 34,000 training points? How can we explain the large jumps in training time?

Some guidance would be greatly appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do Multi-task GPs not utilizie preconditioned linear cg? #2648

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why do Multi-task GPs not utilizie preconditioned linear cg? #2648

Uh oh!

Uh oh!

Cef0PT Mar 22, 2025

TL;DR

Too long, but want to read anyway to get some context

Current implementation for multi-task GPs

Preconditioned linear conjugate gradient method

Some simple tests

Replies: 0 comments

Cef0PT
Mar 22, 2025