Jasper Slingsby
Variables are latent if they are unobserved or estimated with uncertainty.
Dietze 2017 outlines 4 common latent variable types:
I’ve already mentioned that a big challenge to modelling is error in the observation of the state variable of interest.
Observation errors are typically either:
random, due to imprecision of the data collection process or other extraneous factors, or
systematic, implying there is some bias
Imprecision in measurement creates random error.
Inaccuracy creates systematic error.
Random error is created by imprecision in measurement (“scatter”) around the true state of the variable of interest, but can be created by other processes.
In this case we may want to model the true state as a latent variable, and model the random observation error as a probability distribution (typically Gaussian) around the mean.
E.g. specifying the data model to be a normal distribution around the mean (\(\mu\)), as we did in the post-fire recovery model: \(NDVI_{i,t}\sim \mathit{N}(\mu_{i,t},\frac{1}{\sqrt{\tau}}) \\\)
In this case \(\mu\) is the latent variable (estimate of the “true state”).
Systematic error is where there is a bias, such as created by differences among observers or poor instrument calibration.
Constant bias can be corrected with an offset, but something like sensor drift may need to be approximated as a random walk or similar (to account for temporal autocorrelation).
If we have more information about the causes of error, we can apply more complex observation models (e.g. differences among field staff, temperature dependence of readings, etc).
Often there is both random and systematic error, requiring a model that accounts for both.
I.e. observing a proxy for the state variable of interest, e.g.
There are many ways to relate the observed proxy(ies) to the latent state variable of interest, such as empirical calibration curves, probabilities of identifying dung correctly, etc.
Where some observations may be missing from the data, these may be estimated with uncertainty in various ways.
Missing data are common in time series or in space (e.g. sensor failure, logistical difficulties, etc.).
Some variables may never be observed (e.g too difficult to measure), but can be inferred from the process model, e.g.
Estimating these latent variables can be tricky, but having multiple independent measures to constrain the estimates or high confidence in the model structure (i.e. mechanistic understanding) can help.
Forecasting involves predicting key variables further in time, and often farther in space.
An issue with time-series or spatial modelling is dependence (“autocorrelation”) among observations in time and space.
One also usually has to deal with a number of latent variables due to missing or sparse data, observation error, etc.
State-space models are a useful framework for dealing with these kinds of problems and for forecasting in general.
The name comes from the focus on estimating the state as a latent variable.
This explicitly separates observation errors from process errors, allowing attractive flexibility, including addressing issues of autocorrelation…
Illustration of a simple univariate state-space model from Auger-Methe et al. 2021.
The dependence among the true (latent) states \(z_{t-1}\), \(z_t\), \(z_{t+1}\), … can be modeled explicitly in the process model. The dependence of the observations \(y_t\) on the states \(z_t\) allows observations to be assumed to be independent.
State estimates can be closer to the true states than the observations.
Here’s an SSM where the process model is a random walk (i.e. change at each time step is just process error (\(\tau_{add}\)) - a random draw from a normal distribution). We’ve also specified a data model with observation error drawn from a normal distribution.
The process model1: \[z_{t+1}\sim\mathit{N}(z_{t},\tau_{add})\]
The data model: \[y_{t}\sim\mathit{N}(z_{t},\tau_{obs})\]
For a Bayesian model this would also require priors on the process error (\(\tau_{add}\)), observation error (\(\tau_{obs}\)) and initial condition of the state variable (\(x_0\)).
The probability distribution for the state variable, \(z_{t}\), conditional on the model parameters would be:
\[ \underbrace{z_{t}|...}_\text{current state} \; \propto \; \underbrace{{\mathit{N}(z_{t}|z_{t-1},\tau_{add})}}_\text{previous time} \; \underbrace{{\mathit{N}(y_{t}|z_{t},\tau_{obs})}}_\text{current observation} \; \underbrace{{\mathit{N}(z_{t+1}|z_{t},\tau_{add})}}_\text{next time} \; \]
Which says that the current state (\(z_{t}\)) depends on both the states before and after as well as the current observation (\(y_{t}\)).
In fact, the posterior of the current state (\(z_{t}\)) is proportional to the product of the three normal distributions.
\[ \underbrace{z_{t}|...}_\text{current state} \; \propto \; \underbrace{{\mathit{N}(z_{t}|z_{t-1},\tau_{add})}}_\text{previous time} \; \underbrace{{\mathit{N}(y_{t}|z_{t},\tau_{obs})}}_\text{current observation} \; \underbrace{{\mathit{N}(z_{t+1}|z_{t},\tau_{add})}}_\text{next time} \; \]
Where the terms are similar, the state estimate peaks, indicating less uncertainty.
\[ \underbrace{z_{t}|...}_\text{current state} \; \propto \; \underbrace{{\mathit{N}(z_{t}|z_{t-1},\tau_{add})}}_\text{previous time} \; \underbrace{{\mathit{N}(y_{t}|z_{t},\tau_{obs})}}_\text{current observation} \; \underbrace{{\mathit{N}(z_{t+1}|z_{t},\tau_{add})}}_\text{next time} \; \]
Where the terms differ, the state estimate flattens, indicating greater uncertainty.
Often, there’s no observation for “next time”, and the model reduces to:
\[ \underbrace{z_{t}|...}_\text{current state} \; \propto \; \underbrace{{\mathit{N}(z_{t}|z_{t-1},\tau_{add})}}_\text{previous time} \; \underbrace{{\mathit{N}(y_{t}|z_{t},\tau_{obs})}}_\text{current observation} \; \]
There’s no “current observation” for forecasts, so:
\[ \underbrace{z_{t}|...}_\text{forecast state} \; \propto \; \underbrace{{\mathit{N}(z_{t}|z_{t-1},\tau_{add})}}_\text{previous time} \; \]