% Ben Bolker % Thu Nov 1 12:42:46 2012

mixed model lecture 1

Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

Examples

Glycera

Various responses (survival, cell area) of Glycera (marine worm) cells in response to 4-way factorial combinations of stressors (anoxia, copper, salinity, hydrogen sulfide)

plot of chunk glycplot1

plot of chunk glycplot2

Arabidopsis

Response of Arabidopsis fruiting success to nutrients and simulated herbivory (clipping) across genotypes/populations/regions

plot of chunk arabplot

Singing mouse

Response of singing mouse to conspecific and heterospecific playback experiments

plot of chunk singingmouse

others

Culcita: protection of corals from sea-star predation by invertebrate symbionts (crabs and shrimp)

plot of chunk culcplot

coral/farmerfish: mortality and growth of coral in response to removal of farmerfish (before-after/control-impact design)
owls: variation in owlet vocalization (“sibling negotiation”) as a function of satiation, sex of parent, arrival time, brood size

etc. etc. etc.

“classical” vs “modern” approaches to ANOVA

Classical: “method of moments” estimation

Decompose total sums of squares into components attributable to different effects, with associated degrees of freedom
figure out which numerator and denominator correspond to testing a particular hypothesis
or, figure out what class of design you have (nested vs. random effects vs. split-plot vs. …) and look it up in a book (e.g. Ellison and Gotelli)
straightforward: always gives an answer
quick
in more complicated designs (e.g. classical genetics estimate of additive + dominance variance), estimates may be negative
hard to apply with experimental complications such as: lack of balance, crossed designs, autocorrelation
users may become obsessed with “appropriate df” and significance testing rather than parameter estimation, confidence intervals, biological meaning

Modern: computational

more flexible (see “complications” above)
can be (much!) slower
need to worry about computational, numerical problems such as failure to converge to the correct answer etc.
allows computation of confidence intervals etc.
mostly handles allocation of effects to levels automatically

What is a random effect?

(See @gelman_analysis_2005)

Philosophical:

we are not interested in the estimates of the random effects
technically, this should be that we are not interested in making inferences about the estimates of the random effects
in frequentist settings, we don't even call these “estimates”, but rather “predictions”, as in BLUP (best linear unbiased predictor) (this term does not technically apply to GLMMs)
the levels are drawn from a larger population
the levels are “exchangeable” (i.e. we could relabel/swap around the levels without any change in meaning)
the estimates are a random variable

Pragmatic/computational:

we have lots of levels, with not much information about each individual level, and possibly unbalanced amounts of information
we don't want to use up the degrees of freedom associated with fitting one parameter for each level; automatically adjust between “completely pooled” (no effect) and “completely separate” (fixed effect)
we want to estimate with shrinkage
we have enough levels that it is practical to estimate a variance (i.e. at least 5-6, preferably more than that)

To Bayesians, the difference between fixed and random is much simpler and more pragmatic: do we add a hyperparameter to the model to characterize the estimates, or estimate them separately (= infinite variance)?

dimensions of data size

information per sample: binary vs. binomial/count vs. Gaussian. Would repeated samples of a particular unit be approximately normal (e.g. mean > 5, or min(successes,failures)>5)? Determines whether we can get away with certain GLMM approximations
samples per block: if small, then we need random effects; if sufficiently large, then fixed- and random-effects will converge
number of blocks: do we have enough to estimate a variance? do we have so many that we can estimate it very accurately (i.e. we don't need to worry about denominator df?)

Examples

genomics, proteomics, telemetry studies: few blocks, many samples per block (individual)
cancer epidemiology: many blocks, varying samples per block, very little information per sample (i.e., cancer is rare)
small-scale field experiments: moderate number of samples and blocks
large-scale field experiments (or observations): often very few blocks

Definition of model and (marginal) likelihood

A useful (if somewhat technical) way to define a plain-vanilla linear model (normally distributed residuals, no random effects) is

\[ Y_i \sim \mbox{Normal}(\mu_i,\sigma^2); \mu = X \beta \] where \( X \) is the design matrix, \( \beta \) are the coefficients (all fixed effects). This is equivalent to, but more useful than, the other standard notation \( Y_i \sim X \beta + \epsilon \), \( \epsilon \sim \mbox{Normal}(0,\sigma^2) \) because some of the distributions we want to work can't simply be added to the expected response …

When we go to random-effects models, we add another level to the model:

\[ Y_i \sim \mbox{Normal}(\mu_i,\sigma^2) \] \[ \mu = X \beta + Z u \] \[ u \sim \mbox{MVN}(0,\Sigma(\theta)) \] where \( Z \) is the random-effects design matrix; \( u \) is the vector of random effects; and \( \theta \) is a vector of parameters that determines the variance-covariance matrix \( \Sigma \) of the random effects. (For example, for a simple intercept-only, model, \( \theta \) might just be the among-block variance.)

GLMs look like this: \[ Y_i \sim \mbox{Distrib}(\mu_i,\phi); \mu = g^{-1}(X \beta) \] where \( \mbox{Distrib} \) is some specified distribution (e.g. binomial or Poisson); \( \phi \) is an optional scale parameter; and \( g^{-1} \) is the inverse link function (e.g. logistic or exponential, discussed further below).

GLMMs combine the inverse-link and arbitrary-distribution stuff with random effects: essentially, \[ \mu = g^{-1}(X \beta + Z u) \] with \( Y \) and \( u \) defined as above.

Because random effects are random variables, the likelihood (or more precisely the marginal likelihood) of a set of parameters balances the probability of the data (\( x \)) given the fixed parameters \( \beta \) and a particular value of the random effect (\( u \)) with the probability of that value of the random effect given the variance of the random effects (\( \sigma \)), integrated over all possible values of \( \sigma \):

\[ L(x|\beta,\sigma) = \int L(x|\beta,u) \cdot L(u|\sigma) \, d\sigma \]

The Bayesian posterior distribution is basically the same, except that we include a prior distribution (and we rarely have to worry about the doing the integral explicitly)

Restricted maximum likelihood (REML)

While maximum likelihood estimation (finding the \( \beta \), \( \theta \) (and possibly \( u \)) that maximize the expression above) is a very powerful and coherent approach, it turns out to give biased estimates of the variances.

A heuristic explanation is that it estimates the variances as though the maximum likelihood estimates of the fixed-effect parameters were correct, which underestimates the variability in the data.

Two useful special cases of REML are (1) dividing sample sums of squares by \( n-1 \) rather than \( n \) to get an unbiased estimate of the sample variance and (2) doing a paired \( t \) test by explicitly computing the means and variances of the differences in the pairs.

Technically, REML consists of finding a linear combination of the data that exactly reduces the fixed-effect estimates to zero, then estimating the random-effects part of the model on those data.

Important things to know about REML:

it's most important for small data sets, and when the variance estimates themselves are of interest
you cannot sensibly compare/test fixed effect estimates between models estimated with REML; to do this, you have to fall back to ML estimates (this is done automatically by some packages)
REML has good properties (unbiased estimates of variances) in simple LMMs, but we're not completely sure whether they hold in more complex models. There is some controversy as to whether extending REML to GLMMs makes sense.

Avoiding MM

average for nested designs (when variance within lower levels is not of interest) [@murtaugh_simplicity_2007]
use fixed effects instead of random effects when (1) samples per block are large (there will be little shrinkage) or (2) number of samples is small (you will save few degrees of freedom, and variance estimates are likely to be bad anyway)

LM reminders

Wilkinson-Rogers formulae:

y ~ f  ## anova
y ~ x  ## regression (implicit intercept term)
y ~ x - 1  ## regression through origin (or y ~ x+0)
y ~ f1:f2  ## interaction only
y ~ f1/f2  ## f2 nested in f1
y ~ f * x  ## interaction between factor and predictor (ANCOVA)
y ~ poly(x, 3)  ## cubic
y ~ ns(x, 3)  ## spline with 3 knots: library(splines)

For most random-effects packages in R, | represents a grouping factor: 1|grp means variation in the intercept among groups, x|grp means variation in the slopes (and implicitly also in the intercepts), 1|grp1/grp2 means variation in the intercepts among group 1 and also among levels of group 2 within a group-1 unit.

parameterization and contrasts Parameters represent differences from baseline level of a factor (alphabetically first by default), interactions determined accordingly. e.g. in y~f1*f2, the f1B parameter would represent the expected difference between f1=A and f1=B when f2=A. Can use contrasts argument or options(contrasts=c("contr.sum","contr.poly")) to change to sum-to-zero contrasts (usually necessary for interpreting “type III”, marginality-violating tests)

GLM reminders

families: binomial, poisson, etc. (specifies mean-variance function and default link function)
link functions: specifies linearizing transformation, usually (not always) prevents impossible predictions (such as prob<0 or prob>1). Defaults: Poisson=log, binomial=logit. Can be tweaked in special cases (e.g. binomial with probit or log link).
- Inverse-log link is usually easier to understand (e.g. exponential, logistic).
- Model is linear on the link-function (“linear predictor”) scale, so interpret parameters as changes in expected value on link scale (e.g. slope of logistic regression = +x change in log-odds per unit change in predictor). We often back-transform for interpretability.
offsets: terms added directly to the linear predictor, without estimating a parameter. Most often used to adjust for differences in sampling effort (time or area), exposure time. e.g. Poisson model