Going Bayesian

Jasper Slingsby

Where are we?

The iterative ecological forecasting cycle in the context of the scientific method, demonstrating how we stand to learn from making iterative forecasts.

Iterative ecological forecasting represents a learning cycle that mirrors (and integrates with) the scientific method and adaptive management, helping us:
- learn by doing
- make our science more useful

Where are we?

It also presents new challenges (and opportunities) to the way we do science, requiring:
1. informatics infrastructure to support our forecasts
2. statistical methods that allow:
  - automatic updates of forecasts for the next time step as new data are acquired
  - quantification and propagation of uncertainty in our forecasts

Going Bayesian

Iterative ecological forecasting presents new challenges not common in the world of null hypothesis testing that has dominated ecology to date…

It is much easier to deal with these in a Bayesian framework.

Here I provide a brief and (relatively) soft introduction to Bayesian statistical theory.

Advantages of Bayesian approaches

They focus on estimating what properties are - i.e. the actual value of parameters, not just establishing that they are different to a null expectation as in null hypothesis testing.
They are highly flexible, allowing complex models with varied data sources, especially Hierarchical Bayesian models. This also allows the development of models that incorporate mechanistic components.
They can treat all terms as probability distributions, making it easier to quantify, propagate and partition uncertainties throughout the analysis. This allows you to present uncertainty in the forecast to the decision maker and analyze the sources of uncertainty, guiding improvements. More on this in the Uncertainty lecture.
They provide an iterative probabilistic framework that allows us to learn from new evidence (data) in the context of existing (prior) knowledge. This makes it easier to update predictions as new data become available, completing the forecasting cycle.

But first…

Before I can introduce Bayes, there are a few basic building blocks we need to establish.

How the method of Least Squares works, and its limitations, especially its inflexible, implicit data model.

The concept of likelihood and the method of maximum likelihood estimation
- this is a major component of Bayes’ Theorem
- it’s a frequentist method, but has the advantages 1 & most of 2 on previous slide
  - in fact, there are frequentist approaches for doing much of the advantages listed on the previous slide, but they are typically cumbersome. Bayes you can do it all without much extra work.

Least Squares

Traditional parametric statistics like regression analysis and analysis of variance (ANOVA) rely on Ordinary Least Squares (OLS).

There are other flavours of least squares that allow more flexibility (e.g. nonlinear (NLS) that we use in the practical, partial least squares, etc), but I’m not going to go into these distinctions.

Least Squares

In general, least squares approaches “fit” (i.e. estimate the parameters of) models by minimizing the sums of the squared residuals.

The model (blue line) is drawn through the points to minimize the sum of the squared vertical (y-axis) differences between each point and the regression line (i.e. residuals).

Redrawn highlighting the residuals:

The residuals are the grey lines (lollypop sticks) linking each observed data point to the values predicted by the model (open circles along the regression line)
They are vertical and not orthogonal to the regression line, because they represent the variance in Y (Reward) that is not explained by X (Effort)
There is no scatter in the predicted values (open circles), because the scatter is residual variance that the model cannot account for and predict.

A histogram of the residuals:

The residuals approximate a normal distribution

This is a key assumption when using Least Squares

termed “homoscedasticity” or “homogeneity of variance”

For minimizing the sums of squares to work, a unit change in the residuals should be scale invariant
- i.e. the difference between 1 and 2 or 101 and 102 must be the same
- This is only true when the residuals are normally distributed (versus log-scale for example)
- If not, minimizing the sums of squares does not work (and people often try to log transform etc their data to get them to an invariant scale).
If violated, consider a different process model (i.e. nonlinear), or another statistical approach (MLE or Bayes).

The shortcomings of Least Squares:

1. Least Squares doesn’t explicitly include a data model

It’s useful at this stage to make a distinction between data models and process models.

The process model is the bit you’ll be used to, where we describe how the model creates a prediction for a particular set of inputs or covariates (e.g. a linear model)
The data model describes the residuals (i.e. mismatch between the process model and the data)
- also often called the data observation process

Least Squares analyses don’t explicitly include a data model, because minimizing the sums of squares means that the data model in a Least Squares analysis can only ever be a normal (Gaussian) distribution (i.e. there must be homogeneity of variance)

Why is this a limitation?

the data model can take many forms (= drawn from different probability distributions)
- e.g. Binomial coin flips, Poisson counts of individuals, Exponential waiting times, etc
- this is where Maximum Likelihood comes into it’s own
  - in the practical, we specify two functions for the MLE analyses:
    - the process model (pred.negexp/S)
    - the likelihood function (fit.negexp/S.MLE), which includes the data model (and this does not have to be a normal distribution)

sometimes one would like to include additional information in the data model
- e.g. sampling variability (e.g. different observers or instruments), measurement errors, instrument calibration, proxy data, unequal variances, missing data, etc
- this is where Bayesian models come into their own

The shortcomings of Least Squares:

2. Least squares focuses on what the parameters are not, rather than what they are

Least Squares focuses on null hypothesis testing - the ability to reject (or to fail to reject) the null hypothesis at some threshold of significance (alpha; usually P < 0.05)
For a linear model, the null hypothesis is that the slope = 0 (i.e. no effect of X on Y)

A linear model is only considered useful when you can reject the null hypothesis.
- This just tells you that the probability of the parameter of interest (the slope in this case) being 0 is very low…
It tells you nothing about the probability of the parameters actually being the estimates you arrived at by minimizing the sums of squares!!!?

Maximum likelihood

The likelihood principle = a parameter value is more likely than another if it is the one for which the data are more probable.

Maximum Likelihood Estimation (MLE) = a method for estimating model parameters by applying the likelihood principle.

It optimizes parameter values to maximize the likelihood that the process described by the model produced the data observed.

Maximum likelihood

The probability of a data point (dotted line) being generated under two alternative hypotheses (sets of parameter values - in this case the means and standard deviations describing the two normal distributions).

Here H1 has a higher likelihood, because the probability density at \(x\) = 1 is higher for H1 than H2 (roughly 0.25 vs 0.025 = 10 times more likely). Usually there’d be a continuum of hypotheses (parameter values) to select from, and there’d be far more data points.

Probability = area under the curve (whole curve sums to 1)

Likelihood = y-axis values (i.e. probability densities)

Likelihood is maximised by shifting the distribution (i.e. changing the mean and SD parameters) such that the data points have the highest likelihood. This maximizes the conditional probability of the data given those parameter values for the model.

Maximum likelihood

With MLE we assume we have the correct model and use MLE to choose parameter values that maximize the conditional probability of the data given those parameter values, i.e. \(P(Data|Parameters)\).

This leaves us knowing the likelihood of the best set of parameters and the conditional probability of the data given those parameters \(P(Data|Parameters)\)

BUT!

What we really want is to know is the conditional probability of the parameters given the data, \(P(Parameters|Data)\), because this allows us to express uncertainty in the parameter estimates as probabilities.

To get there, we apply a little probability theory, with a surprisingly useful byproduct!

First, let’s brush up on joint, conditional and marginal probabilities

The interactions between two random variables are illustrated below…

Venn diagram illustrating ten events (points) across two sets \(x\) and \(y\).

Joint probability

The joint probability, \(P(x,y)\), is the probability of both \(x\) and \(y\) occurring simultaneously

This is the probability of occurring within the overlap of the circles = 3/10
Needless to say the joint probability of \(P(y,x)\) is identical to \(P(x,y)\) (= 3/10)

Conditional probability

We can also define two conditional probabilities:

the probability of \(x\) given \(y\), \(P(x|y)\)
the probability of \(y\) given \(x\), \(P(y|x)\)

In the diagram these would be:

the probability of being in set \(x\) given we’re only considering points in set \(y\) (= 3/6)
the probability of being in set \(y\) given we’re only considering points in set \(x\) (= 3/7)

Marginal probability

Last, we can define two marginal probabilities for \(x\) and \(y\).

These are just the separate probabilities of any of the full set of events being in set \(x\) or being in set \(y\), i.e.

\(P(x)\) = 7/10 and
\(P(y)\) = 6/10

Joint, conditional and marginal probabilities

We can also show that the joint probabilities are the product of the conditional and marginal probabilities:

\[P(x,y) = P(x|y) P(y) = 3/6 * 6/10 = 0.3\]

and:

\[P(y,x) = P(y|x) P(x) = 3/7 * 7/10 = 0.3\]

Tadaa!

This means we can derive what we want to know, \(P(Parameters|Data)\), as a function of the information maximum likelihood estimation provides, \(P(Data|Parameters)\).

…Let’s do the derivation…

Bayes’ Theorem

Now let (instead of \(x\) and \(y\)):

parameters = \(\theta\)
the data = \(D\)

We’re interested in the conditional probability of the parameters given the data, \(p(\theta|D)\).

To get this, we need to take the previous equations and solve for \(p(\theta|D)\).

Since we know the joint probabilities are identical:

\[p(\theta,D) = p(\theta|D)p(D)\]

\[p(D,\theta) = p(D|\theta)p(\theta)\]

We can take the right hand side of the two:

\[p(\theta|D)p(D) = p(D|\theta)p(\theta)\]

and solve for \(p(\theta|D)\) as:

\[p(\theta|D) = \frac{p(D|\theta) \; p(\theta)}{p(D)} \;\;\]

which is known as Bayes’ Theorem!

Bayes’ Theorem

Rewriting the terms on one line allows us to label them with their names:

\[ \underbrace{p(\theta|D)}_\text{posterior} \; = \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; / \; \underbrace{p(D)}_\text{evidence} \]

The evidence

The marginal probability of the data, \(p(D)\), is now called the evidence for the model, and represents the overall probability of the data according to the model.

having \(p(D)\) in the denominator means it normalizes the numerators to ensure that the updated probabilities sum to 1 over all possible parameter values - i.e. it ensures that the posterior, \(p(\theta|D)\), is expressed as a probability distribution
\(p(D)\) can otherwise mostly be ignored and is often left out, resulting in Bayes’ Rule

Bayes’ Rule

Getting rid of the evidence allows us to focus on the important bits:

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

Which reads “The posterior is proportional to the likelihood times the prior”.

This leaves us with three terms:

The likelihood

\(p(D|\theta)\) still represents the probability of the data given the model with parameter values \(\theta\), and is used in analyses to find the likelihood profiles of the parameters.

Bayes’ Rule

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

The posterior

What we have been calling the conditional probability of the parameters given the data, \(p(\theta|D)\), is called the posterior
It gives us a probability distribution for the values any parameter can take, allowing us to represent uncertainty in the model and forecasts as probabilities, which is the holy grail we were after!!!
- i.e. the posterior(s) is/are the parameter(s) of interest that we forecast, and we can now indicate the probability of our forecast being correct

Bayes’ Rule

\[ \underbrace{p(\theta|D)}_\text{posterior} \; \propto \; \underbrace{p(D|\theta)}_\text{likelihood} \;\; \underbrace{p(\theta)}_\text{prior} \; \]

The prior

The marginal probability of the parameters, \(p(\theta)\), is now called the prior.
While \(p(\theta)\) is almost an unexpected consequence of having solved the equation for \(p(\theta|D)\), the prior is where a lot of the magic of Bayes lies!
Firstly, we know that \(p(\theta)\) is a necessary requirement for us to be able to represent the posterior, \(p(\theta|D)\), as a probability distribution (what we were looking for).
But, it also represents the credibility of the parameter values, \(\theta\), without the data, \(D\).
- What does this mean?

The prior

The prior represents the credibility of the parameter values, \(\theta\), without the data, \(D\).

“But how can we know anything about the parameter values without the data?”

By applying the scientific method, whereby we interrogate new evidence (the data) in the context of previous knowledge or information to update our understanding.

\(p(\theta)\) is the prior belief of what the parameters should be, before interrogating the data
- i.e. our “context of previous knowledge or information”
if we don’t have much previous knowledge, we can specify the prior to represent this by indicating our uncertainty

The influence of the prior vs the data?

Comparison of the effects on the posterior probability situations where there is (A) high uncertainty in the prior and low observation error in the data (the likelihood), versus (B) low uncertainty in the prior and high observation error. Note that the data and prior have the same means in panels A and B, but the variances differ.

The power of the prior

It allows Bayesian analyses to be iterative.

The posterior from one analysis can become the prior for the next!

This provides a formal probabilistic framework for the scientific method!
New evidence must be considered in the context of previous knowledge, providing the opportunity to update our beliefs.
- This is ideal for iterative forecasting, because the previous forecast can be the prior for the next.

The power of the prior

It makes Bayesian models incredibly flexible, allowing us to specify complex hierarchical models in a completely probabilistic framework relatively intuitively.
- This is fantastic for “fusing” disparate data sources!
  - E.g. we may not have “prior beliefs” about a parameter’s values, but we may know something about another parameter or process that influences it. If so, we can specify a prior on our prior (called a hyperprior).
- We can also specify separate or linked priors (or hyperpriors) for different parameters (or priors).

The prior comes with great responsibility…

It is very easy to specify an inappropriate prior and bias the outcome of your analysis!

The prior is incredibly powerful! But the Peter Parker Principle applies - “With great power comes great responsibility!”

First and foremost, as tempting as it may be, the one place the prior CANNOT come from is the data used in the analysis!!!
Second, while many people favour specifying “uninformative” priors, it is often incredibly difficult to know what “uninformative” is for different models or parameters.
- Here’s a great Guardian article on why considering priors are important for Covid testing and criminal cases. It should help you get your head around this.

We’ll get stuck into examples showing the value of Bayes in ecology in the next lecture.

Going Bayesian

Where are we?

Where are we?

Going Bayesian

Advantages of Bayesian approaches

But first…

Least Squares

Least Squares

The shortcomings of Least Squares:

Why is this a limitation?

The shortcomings of Least Squares:

Maximum likelihood

Maximum likelihood

Maximum likelihood

First, let’s brush up on joint, conditional and marginal probabilities

Joint probability

Conditional probability

Marginal probability

Joint, conditional and marginal probabilities

Bayes’ Theorem

Bayes’ Theorem

The evidence

Bayes’ Rule

The likelihood

Bayes’ Rule

The posterior

Bayes’ Rule

The prior

The prior

The influence of the prior vs the data?

The power of the prior

The power of the prior

The prior comes with great responsibility…

References