How to forget in an online linear regression

Recently, I tweeted the following:

Online linear regression is a cool special case of Kalman Filtering. Does anyone know how to delete an observation? I’m interested in fitting a rolling window and it would be more efficient to do one update and one deletion rather than refitting each time…

I got some useful suggestions, and I think many of them would have led to a solution. However, the one that was easiest for me to try and worked was basically suggested by ponzu, who wrote:

use Woodbury equality and add is u while del is -u

Now, I actually didn’t completely work out what he meant, but the term “Woodbury” set me on a path that worked out, so I thought I’d share that in case it’s useful for others.

How to update

An online linear regression is basically just a sequence of multivariate normal updates. To be clear, let’s consider a single data point, made up of the row vector $\mathbf{x}$ of covariates and the observation $y$. Let’s further say that our current estimate of our parameters $\theta$ is given by the distribution

\[\theta \sim \mathcal{N}(\mu, \Sigma).\]

Then we can write that

\[y = \mathbf{x}^T \theta + \epsilon,\]

where

\[\epsilon \sim \mathcal{N}(0, \sigma^2),\]

the observation noise.

To me, the easiest way to derive the updating rule is to consider the joint distribution of $\theta$ and $y$. Because they are jointly normal, we can derive it using just means and covariances. I’ll spare the details, but here’s the result:

\[\left[\begin{array}{c} \theta \\ y \end{array}\right] \sim \mathcal{N}\left( \left[\begin{array}{c} \mu \\ \mathbf{x}^T \mu \end{array}\right] , \left[\begin{array}{ c | c } \Sigma & \Sigma \mathbf{x} \\ \hline \mathbf{x}^T \Sigma & \mathbf{x}^T \Sigma \mathbf{x} + \sigma^2 \end{array}\right] \right)\]

To find the distribution of $\theta \mid y$, we can use the conditional distribution identity, which gives:

\[\bar{\mu} = \mu + \Sigma \mathbf{x}(\mathbf{x^T}\Sigma\mathbf{x} +\sigma^2)^{-1}(y - \mathbf{x}^T \mu)\]

for the posterior mean, and

\[\bar{\Sigma} = \Sigma - \Sigma \mathbf{x}(\mathbf{x^T}\Sigma\mathbf{x} +\sigma^2)^{-1}\mathbf{x}^T\Sigma\]

for the posterior covariance. This allows you to do online Bayesian linear regression: just update one data point at a time with these equations, updating your estimates of $\mu$ and $\Sigma$ every time. But what if we want to forget an update?

How to forget

How do we recover $\Sigma$ and $\mu$ if we know the posterior – i.e. $\bar{\Sigma}$ and $\bar{\mu}$ – as well as $\mathbf{x}$ and $y$?

I first looked at the update for $\bar{\Sigma}$ and initially didn’t see how to solve for $\Sigma$. The thing that I found tricky is that it appears twice in the second term on the right hand side. Perhaps there’s some easy way to solve it, but I couldn’t spot it.

However, the suggestion to look at the Woodbury formula gave me an idea. That identity is:

\[(A + UCV)^{-1} = A^{-1} - A^{-1}U(C^{-1} + V A^{-1}U)^{-1}VA^{-1}\]

If we stare a little bit, we can match the right hand side up with the expression for $\bar{\Sigma}$. We set:

$A = \Sigma^{-1}$
$U = \mathbf{x}$
$C = \sigma^{-2}$
$V = \mathbf{x}^T$

This then gives the more compact expression:

\[\bar{\Sigma} = (\Sigma^{-1} + \mathbf{x}\sigma^{-2}\mathbf{x}^T)^{-1}\]

Now we can do some algebra to find the following neat result, which I think ponzu was alluding to:

\[\Sigma = (\bar{\Sigma}^{-1} - \mathbf{x}\sigma^{-2}\mathbf{x}^T)^{-1}\]

To get to $\mu$, we can define

\[\mathbf{h} = \Sigma \mathbf{x}(\mathbf{x}^T \Sigma \mathbf{x} + \sigma^2)^{-1}\]

to make the maths a bit easier – and we can compute this now because we can compute $\Sigma$, remember. After a little bit of algebra, we find

\[\mu = (I - \mathbf{h}\mathbf{x}^T)^{-1}(\bar{\mu} - \mathbf{h}y).\]

I suspect there’s a neater expression, but this is good enough for me, and I hope it’s useful to you too. If you can simplify to something nicer, let me know!

Code example

Here’s some code to try this. First, generate some toy data:

import numpy as np

np.random.seed(2)

N = 5
x = np.random.randn(5, 1)
mu = np.random.randn(N, 1)
y = np.random.randn()
sigma = np.random.randn(N, N)
sigma = (sigma @ sigma.T + np.eye(N)) * 0.1
sigma_obs = np.random.randn()**2

Next, make the update:

sigma_y_sq = x.T @ sigma @ x + sigma_obs**2

mu_bar = mu + sigma @ x * sigma_y_sq**(-1) * (y - x.T @ mu)
sigma_bar = sigma - (sigma @ x) * sigma_y_sq**(-1) * (x.T @ sigma)

Now, recover the original parameters:

# I normally would rearrange this to avoid the explicit inverses
# but won't here
sigma_rec = np.linalg.inv(np.linalg.inv(sigma_bar) - x @ x.T * sigma_obs**(-2))
h = sigma_rec @ x * (1 / (x.T @ sigma_rec @ x + sigma_obs**2))
mu_rec = np.linalg.solve(np.eye(N) - h @ x.T, mu_bar - h * y)

mu_rec and sigma_rec should now match mu and sigma.

Conclusion

I hope you’ve found this little post interesting. As I mentioned before, let me know if you know of a way to simplify the expressions, especially for $\mu$!

Written on December 3, 2022