Hide code cell content

try:
    import smolgp
except ImportError:
    %pip install -q smolgp

import jax
jax.config.update("jax_enable_x64", True)

An Introduction to State Space Gaussian Processes#

This guide is to serve as a theoretical introduction to the mathematical framework behind state space Gaussian Processes. We will assume the reader is already familiar with Gaussian Processes (GPs) from their usual definitions as either the infinite-dimensional version of linear regression with Gaussian weights (the “weight-space view”) or (equivalently) as the distribution over functions such that any finite set of points from this family of functions is jointly Gaussian (the “function-space view”). For those unfamiliar, we recommend Chapter 2 of Rasmussen & Williams 2006 as well as the excellent review by Aigrain and Foreman-Mackey 2023, which additionally includes a review of GP applications in astronomy. For a more practical/visual introduction to how GPs behave in response to observed data, we highly recommend this blog post and playing around with this interactive visualization.

The following is essentially an interactive version of Section 2 in Rubenzahl and Hattori et al. 2026.

1. Motivation#

The state space framework, in one sentence, reinterprets certain GPs as the family of solutions to a linear system of stochastic[1] differential equations.

Why would we want to restate the problem in such a way?

By doing so, we can instead think of the process as evolving from state to state like a Markov process whose predictive state (mean and variance) only depends on the previous state; it is thus solved via an iterative algorithm in \(\mathcal{O}(N)\), whereas traditional GP regression is \(\mathcal{O}(N^3)\). Namely, the Kalman filter (Kalman 1960) and Rauch-Tung-Striebel (RTS; Rauch, Tung, & Striebel 1965) smoothing algorithm yield precisely the conditioned GP mean and variance. Beyond the performance boost, reinterpreting our model in such a way can reveal new insights about how the process itself evolves and behaves.

What do we mean by “certain” GPs?

Any valid GP kernel whose power spectral density (PSD) is a rational function permits a state space representation (see Chapter 12.3 of Särkkä & Solin 2019). Common examples include:

  1. the Matérn half-integer family (1/2[2], 3/2, 5/2, etc.)

  2. the :py:func:Cosine <smolgp.kernels.base.Cosine> kernel

  3. the stochastic harmonic oscillator

In other words, complex exponential kernels. These (and others) are implemented in :mod:smolgp.kernels. This makes state space GPs closely related to celerite kernels (Foreman-Mackey et al. 2017).

Also related are the quasiseparable family of kernel functions, which power the scalable \(\mathcal{O}(N)\) solvers in tinygp and represent a subset of state space GPs. In fact, tinygp uses the state space definition under the hood to construct the quasiseparable matrices, and then utilizes quasiseparable linear algebra for fast solving of the conditioned/predictive means and covariances (see tinygp.solvers.quasisep). However, because the solution happens in matrix space, there remain limitations to model flexibility that are relaxed when solved in the state space (e.g. the general case of exposure integration cannot be framed in quasiseparable matrix form).

General GP kernels as state space models

For GP kernel functions which do not have rational PSDs, such as the periodic (aka “exponential sine squared”) or RBF (aka “squared-exponential”) kernels, an approximate state space model can be constructed (to arbitrary precision) by a series expansion of the PSD into rational basis functions. This is implemented for the periodic kernel in :py:func:smolgp.kernels.ExpSineSquared using the method of Solin & Särkkä 2014. A similar method can be done for the RBF kernel (Hartikainen & Särkkä 2010) but is not yet implemented in smolgp.

2. The GP problem statement#

A temporal GP defines a probability distribution over functions of time \(t\). For a GP with zero mean[3] and covariance defined by the kernel function \(k(t, t')\), these functions \(f(t)\) are

(1)#\[\begin{align} f(t) \sim \text{GP}(0,\, k(t,t')). \end{align}\]

We often work with stationary GP kernels, meaning they are functions only of the time difference \(\Delta \equiv |t - t'|\) between two observations. That is, we have \(k(\Delta)\) instead of \(k(t,t')\). We model our measurements \(\boldsymbol{y}\) as being noisy samples of the process

(2)#\[\begin{align} y_k = f(t_k) + \epsilon_k, \end{align}\]

where \(\epsilon_k \sim N(0,\sigma_k^2)\) is the measurement noise. Conditioning the GP on the observed data then yields the predictive mean and variance of the resulting process:

(3)#\[\begin{split}\boldsymbol{\mu}_{GP} &= \boldsymbol{K}_\ast^T (\boldsymbol{K} + \boldsymbol{N})^{-1} \boldsymbol{y} \nonumber \\ \boldsymbol{\Sigma}_{GP} &= \boldsymbol{K}_{\ast\ast} - \boldsymbol{K}_\ast^T (\boldsymbol{K} + \boldsymbol{N})^{-1} \boldsymbol{K}_\ast \end{split}\]

where \(\boldsymbol{K}\) is the covariance matrix computed from the kernel function for all pairs of observations. \(K_\ast\) denotes the kernel function evaluated between the data and an arbitrary set of “test” points, i.e. \(K_{ij} = k(t_{i}, t_{\ast j})\). Likewise, \(K_{\ast\ast}\) is the same for all pairs of test points. \(\boldsymbol{N}\) is the (usually diagonal) matrix of measurement uncertainties.

Eq. (3) suffers from the curse of dimensionality, scaling computationally as \(O(N^3)\). Even holding these matrices in memory scales as \(O(N^2)\) which can be hundreds of GB for datasets with several tens of thousands of points. Quasiseparable methods can convert these computations down to \(O(N\log N)\), or even \(O(N)\) for certain kernels, but come with restrictions on model flexibility.

3. Reframing GPs as a stochastic differential equation#

Note

See the hidden code cells at the end of each section for a worked example.

In the state space formalism, we can think of these functions \(f(t)\) as instead solving a \(d\)-th order linear stochastic differential equation (SDE) driven by a white noise process \(w(t)\):

(4)#\[\begin{split}\frac{d\boldsymbol{x}(t)}{dt} &= \boldsymbol{F} \boldsymbol{x}(t) + \boldsymbol{L} \boldsymbol{w}(t) \nonumber \\ f(t_k) &= \boldsymbol{H_k} \boldsymbol{x}(t_k) \end{split}\]

where we now have a length-\(d\) vector \(\boldsymbol{x}(t) = (x, \dot{x}, \ddot{x}, ...)^T\) of the latent process \(x(t)\) and its first \(d-1\) time derivatives. The matrix \(\boldsymbol{F}\), called the feedback matrix, holds the coefficients of the SDE. The vector \(\boldsymbol{L}\), called the noise effect matrix, injects the white noise into one (or more) of the components of \(\boldsymbol{x}\). The white noise itself is defined by its spectral density \(\boldsymbol{Q_c}\):

(5)#\[\begin{align} \text{E}[w(t)w(t')^T] = Q_c\delta_\text{dirac}(t-t') \end{align}\]

Lastly, the observation matrix \(\boldsymbol{H}_k\) projects the latent state vector at given time \(\boldsymbol{x}(t_k)\) to the observed space \(f(t_k)\). When dealing with instantaneous GP kernels and if only the state (i.e. none of its derivatives) is measured, then \(\boldsymbol{H}_k\) is simply a constant row vector \([1, 0,...,0]\) that picks out the latent state component of \(\boldsymbol{x}\). In general, though, the observation matrix may vary from data point to data point, may include an amplitude term (which could be different depending on which instrument made the measurement), etc. We can also apply linear operators to \(\boldsymbol{H}\) and preserve the linear Gaussian behavior of the state space system; This will become relevant later when we consider exposure-integrated measurements.

Like before, our measurements are noisy samples of the observed latent process

(6)#\[\begin{align} y_k = f(t_k) + \epsilon_k = \boldsymbol{H_k} \boldsymbol{x}(t_k) + \epsilon_k, \end{align}\]

where the measurement noise \(\epsilon_k \sim N(0, \boldsymbol{R})\) has mean zero and covariance matrix \(\boldsymbol{R}\) (e.g. diagonal matrix of measurement uncertainties).

The matrices \(\boldsymbol{F}\), \(\boldsymbol{L}\), and \(\boldsymbol{Q_c}\) can all be derived for a given GP kernel function \(k(\Delta)\) from its power spectral density \(S(\omega)\) (which is the Fourier dual of the kernel function). See Section 4 of Hartikainen and Särkkä (2010) for examples of doing so for the Matérn family. The corresponding SDE for the system is derived from \(S(\omega)\), and then \(\boldsymbol{F}\) and \(\boldsymbol{L}\) are determined by the putting the SDE in “companion form” (see details here) and reading off the various coeffients. To get \(\boldsymbol{Q_c}\), we need one extra ingredient, the stationary covariance \(P_\infty\). \(P_\infty\) can be thought of as the covariance after the system has settled (\(t \rightarrow \infty\)), or equivalently as the prior GP covariance before seeing any data. It has the following definition in the state space formalism as the solution to the continuous-time Lyapunov equation

(7)#\[\begin{align} \frac{d\boldsymbol{P}}{dt} = \boldsymbol{F} \boldsymbol{P}_\infty + \boldsymbol{P}_\infty \boldsymbol{F}^T + \boldsymbol{L}\boldsymbol{Q}_c\boldsymbol{L}^T = 0. \end{align}\]

Using \(\boldsymbol{F}\) and \(\boldsymbol{L}\), one can solve this equation for the elements of \(P_\infty\) given some unknown \(Q_c\) (a 1x1 scalar). The result should be a diagonal matrix. The first element corresponds to the stationary covariance of the latent process, i.e. the kernel function evaluated at zero time-lag, \(k(0)\). This equivalence allows one to define \(Q_c\) in terms of the kernel parameters, and from there the values of \(P_\infty\) are readily computed.

The key feature of Eq. (4) is that it can be solved for discrete points (e.g. observations) by recursive iteration over the data points via filtering/smoothing algorithms in O(N) time. To start the iteration, we initialize the state to the kernel mean function (usually zero) and covariance to the stationary covariance: \(x_0 \sim N(0,P_\infty)\).

Summary of continuous-time matrices#

Matrix

  Size  

Name

Description

Analogy to GP

\(\boldsymbol{F}\)

\( d\times d\)

Feedback matrix

Governs the instantaneous deterministic dynamics of the latent state vector (in the absence of driving noise). The eigenvalues describe how perturbations to that state evolve (e.g. decay).

Encodes the characteristic timescales of the process and the global correlation structure (decay, oscillations, periodicitiy, etc.).

\(\boldsymbol{L}\)

\( d\times D\)

Noise effect matrix

Maps the driving white noise to the latent state vector and its derivatives. Is simply \((0,\dots,1)\) for the cases we consider here (i.e. only the highest derivative term is driven by noise).

How smooth the GP kernel is. If \(L\) has multiple nonzero entries, it is like a sum of GPs with varying smoothness.

\(\boldsymbol{Q}_c\)

\( D\times D\)

Spectral density

Defines the amplitude of the white noise driving the process.

Represents the power of the GP kernel’s driving noise (related to the kernel’s PSD amplitude).

\(\boldsymbol{P}_\infty\)

\( d\times d\)

Stationary covariance

Prior covariance of the latent state (i.e. before observing any data).

Related to the GP kernel’s (and its derivative’s) amplitude.

Note: Recall \(d\) is the state (\(\boldsymbol{x}\)) dimension and \(D\) is the data (\(\boldsymbol{y}_n\)) dimension; here we consider 1-D timeseries, so \(D=1\), but multivariate data are also compatible (see Multivariate Data).

Summary of measurement matrices#

Matrix

    Size    

Name

Description

Analogy to GP

\(\boldsymbol{H}_k\)

\( D\times d\)

Observation model

Projects the latent \(k^{th}\) state to the observed space.

Linear transformation from the latent process to the observed \(f(t)\).

\(\boldsymbol{R}_k\)

\( D\times D\)

Observation noise

The measurement variances and covariances (if any).

Identical to the noise matrix (usually called \(\boldsymbol{N}\)).

The Solution to the SDE#

Once the model matrices \(\boldsymbol{F}\), \(\boldsymbol{L}\), \(\boldsymbol{Q_c}\), \(\boldsymbol{P_\infty}\), and \(\boldsymbol{H}\) are known, Eq. (4) can be solved recursively as

(8)#\[\begin{align} \boldsymbol{x}_{k+1} = \boldsymbol{A_k} \boldsymbol{x_k} + \boldsymbol{q_k}, \quad \boldsymbol{q_k} \sim N(\boldsymbol{0},\boldsymbol{Q_k}) \end{align}\]

where \(\boldsymbol{x}_k \equiv \boldsymbol{x}(t_k)\). The matrix \(\boldsymbol{A_k}\), called the transition matrix, is defined by the matrix exponential of the feedback matrix times the time difference between steps \(\Delta_k = t_{k+1}-t_k\):

(9)#\[\begin{align} \boldsymbol{A_k} = \boldsymbol{\Phi}(\Delta_k) \equiv \exp(\boldsymbol{F} \Delta_k). \end{align}\]

The matrix \(Q_k\) is called the process noise covariance and is given by

(10)#\[\begin{align} Q_k &= \int_0^{\Delta_k} \Phi(\Delta_k - \tau) L Q_c L^T \Phi (\Delta_k - \tau)^T d\tau \nonumber \\ &= \boldsymbol{P_\infty} - \boldsymbol{A_k} \boldsymbol{P_\infty} \boldsymbol{A}_k^T. \end{align}\]

The last equality was found in Solin and Särkkä 2014b and is a much more efficient means of computing \(\boldsymbol{Q}_k\) than via the full Lyapunov integral, as \(P_\infty\) is more readily calculated; this is the default behavior of :py:func:smolgp.kernels.StateSpaceModel.process_noise if no analytic \(\boldsymbol{Q}_k\) is given. Another approach involves computing the matrix exponential of a block upper-triangular matrix involving \(\boldsymbol{F}\) and \(\boldsymbol{L}Q_c\boldsymbol{L}^T\), and then multiplying two submatrices of the result gives \(Q\) (Van Loan 1978); see :py:func:smolgp.helpers.Q_from_VanLoan.

Because everything is Gaussian, we need only track the means and variances of these \(x_k\) to generate predictions or compute likelihoods. A vast literature anchored in control theory exists for doing just that, as countless real world problems can be modeled as state space systems. The Kalman filter and RTS smoothing algorithms, discussed more below, yield the optimal predictions for such linear Gaussian systems; these optimal predictions are identical to the GP conditioned mean and variance. See Särkkä and Svensson 2023 for the full derivations. Formally, these algorithms scale as \(O(N d^3)\) as they involve matrix operations (multiplications, inversions) on \(d \times d\) matrices (at largest) per iteration through the \(N\) data points. For the state spaces we consider, \(d = 2\) or 3, so for realistic datasets \(N \gg d\) and so \(O(N d^3) \simeq O(N)\).

Summary of discrete-time matrices#

Matrix

  Size  

Name

Description

Analogy to GP

\(\boldsymbol{A}_k\)

\( d\times d\)

Transition matrix

Evolves the state vector forward by one time step \(\Delta_k\).

How much the latent GP function evolves over a time-lag \(\Delta_k\).

\(\boldsymbol{Q}_k\)

\( d\times d\)

Process noise

Accumulated white noise injected into the state over one time step \(\Delta_k\).

How the GP predictive variances grows more uncertain over a time-lag \(\Delta_k\) due to the accumulated stochastic variability.

4. GP regression in the state space#

The Kalman filter#

The Kalman filter (Kalman 1960) is an iterative algorithm that steps through the data array, predicting the state at the current iteration based on the previous. The iteration starts at the prior mean \(m_0\) (in many GP cases we have zero-mean) and prior covariance \(P_0 = P_\infty\).

The algorithm is (Theorem 6.6 in Särkkä and Svensson 2023):

(11)#\[\begin{align} \text{Prediction:}& \nonumber \\ \boldsymbol{m}_{k}^{-} &= \boldsymbol{A}_{k-1} \boldsymbol{m}_{k-1}, \nonumber \\ \boldsymbol{P}_{k}^{-} &= \boldsymbol{A}_{k-1} \boldsymbol{P}_{k-1} \boldsymbol{A}_{k-1}^T + \boldsymbol{Q}_{k-1}. \\ \text{Update:}& \nonumber \\ v_k &= y_k - \boldsymbol{H}_k \boldsymbol{m}_k^{-}, \nonumber \\ \boldsymbol{S}_k &= \boldsymbol{H}_k \boldsymbol{P}_k^{-} \boldsymbol{H}_k^T + \boldsymbol{R}_k, \nonumber \\ \boldsymbol{K}_k &= \boldsymbol{P}_k^{-} \boldsymbol{H}_k^T \boldsymbol{S}_k^{-1}, \nonumber \\ \boldsymbol{m}_k &= \boldsymbol{m}_k^{-} + \boldsymbol{K}_k v_k, \nonumber \\ \boldsymbol{P}_k &= \boldsymbol{P}_k^{-} - \boldsymbol{K}_k \boldsymbol{S}_k \boldsymbol{K}_k^T. \end{align}\]

The prediction step can be thought of as simply transitioning from the previous (filtered) state to the current timestep.

The update step first calculates the “surprise term” \(v_k\) (also called the innovation), which is simply the difference between our prediction (when projected into the observed space via the observation model \(\boldsymbol{H}\)) and the actual measured value. The uncertainty in the prediction (\(\boldsymbol{S}_k\)), also called the innovation covariance, defines the Kalman Gain (\(\boldsymbol{K}_k\)), which tells us how much to trust the measurement; it weights the surprise term for the purpose of updating our predicted mean \(\boldsymbol{m}_k\) and variance \(\boldsymbol{P}_k\).

The Rauch–Tung–Striebel (RTS) smoother#

The Rauch-Tung Striebel (RTS) smoothing algorithm (Rauch, Tung, Striebel 1965) applies the transition matrix to the Kalman filtered mean and covariance estimates to update those predictions in reverse-chronological order. Note this is not the same as applying Kalman filtering in reverse as there are causality assumptions built into the noise model that need to be treated carefully. In other words, conditioning a GP on two halves of a dataset and stitching them together violates the joint covariance structure. While the Kalman filter walks the state forward in time according to the noise model, the RTS smoother corrects those forward predictions (Bayesian refinement) using information about the future.

The final result of Kalman filtering followed by RTS smoothing is the optimal predictive distribution that incorporates all available data. This is mathematically equivalent to full GP conditioning when the state space SDE is linear and the driving noise is Gaussian (See Särkkä and Hartikainen 2012 and Ch. 12.4 of Särkkä and Solin 2019, and references therein in). I.e. \(\hat{\boldsymbol{m}} \equiv \boldsymbol{\mu}_{GP}\) and \(\hat{\boldsymbol{P}} \equiv \boldsymbol{\Sigma}_{GP}\).

The RTS algorithm is Theorem 12.2 in (Särkkä and Svensson 2023):

(12)#\[\begin{align} \boldsymbol{m}_{k+1}^{-} &= \boldsymbol{A}_k \boldsymbol{m}_k, \nonumber \\ \boldsymbol{P}_{k+1}^{-} &= \boldsymbol{A}_k \boldsymbol{P}_k \boldsymbol{A}_k^T + \boldsymbol{Q}_k, \nonumber \\ \boldsymbol{G}_k &= \boldsymbol{P}_k \boldsymbol{A}_k^T \left[ \boldsymbol{P}_{k+1}^{-} \right]^{-1}, \nonumber \\ \hat{\boldsymbol{m}}_k &= \boldsymbol{m}_k + \boldsymbol{G}_k \left[ \hat{\boldsymbol{m}}_{k+1} - \boldsymbol{m}_{k+1}^{-} \right], \nonumber \\ \hat{\boldsymbol{P}}_k &= \boldsymbol{P}_k + \boldsymbol{G}_k \left[ \hat{\boldsymbol{P}}_{k+1} - \boldsymbol{P}_{k+1}^{-} \right] \boldsymbol{G}_k^T. \end{align}\]

Where \(\boldsymbol{m}_k\) and \(\boldsymbol{P}_k\) are the outputs of the Kalman filter (see above). Note that the first two steps (\(\boldsymbol{m}_{k+1}^{-}\) and \(\boldsymbol{P}_{k+1}^{-}\)) are also determined during the Kalman algorithm (i.e. they are the predicted mean and covariance in latent space). \(\boldsymbol{G}_k\) is the “smoothing gain.”

The log-likelihood#

A byproduct of the above algorithms are the ingredients to compute the log-likelihood (Eq. 16.5 in Särkkä and Svensson 2023)

(13)#\[\begin{align} p(\boldsymbol{y}_{1:N} | \boldsymbol{\theta}) = \prod_{n=1}^{N} p(y_n|\boldsymbol{y}_{1:n-1}, \boldsymbol{\theta}). \end{align}\]

where the individual \(p(y_n|\boldsymbol{y}_{1:n-1},\boldsymbol{\theta})\) are given by the first two pieces of the “update” step in the Kalman filter

(14)#\[\begin{align} p(y_n|\boldsymbol{y}_{1:n-1},\boldsymbol{\theta}) = N(\boldsymbol{H}_n \boldsymbol{m}_n^{-}, \boldsymbol{S}_n). \end{align}\]

The total log-likelihood \(\mathcal{L}(\boldsymbol{y}|\boldsymbol{\theta}) = \log p(\boldsymbol{y}_{1:N}|\boldsymbol{\theta})\) is thus

(15)#\[\begin{align} \mathcal{L}(\boldsymbol{y}|\boldsymbol{\theta}) = -\frac{1}{2}\sum_{n=1}^N \left( \log\det(2\pi \boldsymbol{S}_n) + \boldsymbol{v}_n^T \boldsymbol{S}_n^{-1} \boldsymbol{v}_n \right). \end{align}\]

which can be input to any of the usual methods for hyperparameter optimization.

Making predictions#

It is straightforward to extend the Kalman/RTS algorithms to predict the mean and covariance at an arbitrary time \(t_\ast\). It is related to the idea of “fast sampling” described in Section 4.6 of Särkkä and Svensson 2023. The full algorithm is presented in Algorithm 1 of Rubenzahl and Hattori et al. 2026.

There are three cases:

  • Retrodiction (\(t_\ast < t_1\)): If the test point is before the first data point, we backwards-predict using the RTS method from the first measurement’s smoothed state, taking the predicted mean at the test point to be zero and the covariance to be the stationary covariance.

  • Interpolation (\(t_1 < t_\ast < t_N\)): If the test point is during the data, we use the usual Kalman prediction step from the most recent data point’s Kalman-filtered state, and then refine the prediction with a RTS smoothing step from the nearest future data point’s smoothed state.

  • Extrapolation (\(t_\ast > t_N\)): If the test point is after the data, we simply use the Kalman prediction from the final datapoint.

If the various quantities (except for \(\boldsymbol{A}\) and \(\boldsymbol{Q}\) which depend on \(\Delta\)) are saved from the Kalman/RTS steps, they can be reused here for some speedup.

An alternative algorithm was described in the appendix of Kelly et al. 2014 using a linearized form for the predicted state at the test point to derive their smoothing equations. While equivalent, it is significantly more computationally expensive as it effectively recomputes the smoothing gain at every test point, which in each case involves a loop through all the future data to that test point.

6. Advanced topics#

Integral measurements#

The primary benefit of smolgp over existing GP packages is its ability to handle integrated measurements in a scalable way. We’ll explore how in the next tutorial: Integrated Measurements.

Derivative measurements#

Similarly, derivative observations essentially come for free since our GP is now a SDE. See Derivative Observations for model-building when your observations are a derivative or linear combination of the state with its derivatives.

Parallelization on GPU#

smolgp has dedicated solvers that run on a GPU in \(\mathcal{O}(N/T + \log T)\) time over \(T\) parallel workers. Check out Parallelized GP solvers on GPU to see how to set that up.

Multicomponent models#

As sums and products of GPs are also valid GPs, the same is true for state space GPs. Conveniently for sums, SSMs independently model the dynamics of each component, making extraction of individual component posteriors free. See Multicomponent Kernels to see how to assemble multicomponent models and extract component-by-component predictions.

Multivariate data#

tutorial coming soon