These are worked examples for a book chapter on mixed models in Ecological Statistics: Contemporary Theory and Application editors Negrete, Sosa, and Fox (available from the Oxford University Press catalog or from Amazon.com or Powell’s Books or …).

It may move or be renamed eventually, but for right now the source (.rmd) file and data files are available from http://www.math.mcmaster.ca/bolker/R/misc/foxchapter:

There’s a lot of material here. I have erred on the side of including things, and on the side of compact rather than elementary code. Try not to be overwhelmed, just skim it the first time and thereafter focus on the parts that are most relevant to your analyses.

I use a very large set of packages in this example, because I am trying out a wide range of approaches. You should whittle these down to just the ones you need for any given analysis … if you want to install all of them at once, the following code should be sufficient (remember that you only have to install packages one time on each computer you are using, but you have to use library() to load them in every new R session). The packages noted below as “not on CRAN” (glmmADMB, coefplot2) are hosted on R-forge (http://r-forge.r-project.org), but in order to install them you may have to install from http://www.math.mcmaster.ca/bolker/R, as below.

pkgs_CRAN <- c("lme4","MCMCglmm","blme",
               "pbkrtest","coda","aods3","bbmle","ggplot2",
               "reshape2","plyr","numDeriv","Hmisc",
               "plotMCMC","gridExtra","R2admb")
install.packages(pkgs_CRAN)
rr <- "http://www.math.mcmaster.ca/bolker/R"
install.packages(c("glmmADMB","coefplot2"),type="source",
                 repos=rr)
## primary GLMM-fitting packages:
library("lme4")
library("glmmADMB")      ## (not on CRAN: see below)
library("MCMCglmm")
library("blme")
library("MASS")          ## for glmmPQL (base R)
library("nlme")          ## for intervals(), tundra example (base R)
## auxiliary
library("ggplot2")       ## for pretty plots generally
## ggplot customization:
theme_set(theme_bw())
library("grid")          ## for unit() (base R)
zmargin <- theme(panel.margin=unit(0,"lines")) ## to squash facets together ...
library("scales")        ## for squish()
library("gridExtra")     ## for grid.arrange()
if (packageVersion("ggplot2")<="1.0.1")
    library("proto")         ## for horizontal line range plot
library("coefplot2") ## coefficient plots (not on CRAN)
library("coda")      ## MCMC diagnostics
library("aods3")     ## overdispersion diagnostics
library("plotMCMC") ## pretty plots from MCMC fits
library("bbmle")     ## AICtab
library("pbkrtest")  ## parametric bootstrap
library("Hmisc")
## for general-purpose reshaping and data manipulation:
library("reshape2")
library("plyr")
## for illustrating effects of observation-level variance in binary data:
library("numDeriv")

You’ll need to source() one bit of external code, which you can download from the same place you got the data files or from my web site:

source("geom-linerangeh.R")  ## for horizontal line ranges

Package versions used (core packages only):

##      lme4  glmmADMB  MCMCglmm      blme  pbkrtest coefplot2      coda 
##     1.1.9     0.8.0      2.21     1.0.4     0.4.2   0.1.3.2    0.17.1 
##     aods3     bbmle 
##     0.4.1    1.0.18

As of December 2014, the released (CRAN) version of lme4 is 1.1-7; that should be sufficient (version 1.1-9 does slightly better on some of the confidence interval calculations below, providing finite instead of NA values).

If you are using Windows and have problems using glmmADMB on some problems, there is a (slightly tedious) workaround:

library("glmmADMB")
bb <- glmmADMB:::get_bin_loc()[["bin_loc"]]
bpath <- gsub("glmmadmb$","",bb)
file.copy(bb,paste0(bpath,"glmmadmb.bak"))
bburl <- "http://admb-project.org/buildbot/glmmadmb/"
download.file(paste0(bburl,
   "glmmadmb-mingw64-r2885-windows8-mingw64.exe"), dest=bb)

Tundra carbon

These data were originally analyzed in Belshe et al. 2013 “Tundra ecosystems observed to be CO\(_2\) sources due to differential amplification of the carbon cycle” Ecology Letters 16 (10), 1307-1315 (doi: 10.1111/ele.12164).

Data exploration

Read the data:

mc1 <- read.csv("tundra.csv",na.strings=c("-","NA"))

A first look at the data, plotting net ecosystem exchange during the growing season (GS.NEE) against year, using colour to distinguish sites, and superimposing a separate linear regresssion (with confidence bands) per site:

ggplot(mc1,aes(x=Year,y=GS.NEE,colour=Site))+geom_point()+
    geom_smooth(method="lm",alpha=0.3)+
    ## oob=squish retains the (truncated) confidence bands;
    ## otherwise bands that went outside the limits would disappear
    ## (requires the 'scales' package be loaded)
    scale_y_continuous(limits=c(-150,400),oob=squish)+
    scale_colour_discrete(guide="none") ## suppress legend

We have suppressed the legend for the colours, because there are a lot of sites and their names will not be particularly meaningful except to tundra biologists. There is lots of variability among sites, both in the trend and in the uncertainty of the trend. The uncertainty is caused in part by noisiness of the data, and part by sparsity/shortness of the time series for individual sites.

In some cases there were multiple observations per site in a single year. We could have handled this by adding a year-within-site random variable, but it proved to be easier to aggregate these cases by calculating the mean, and account for the different amount of sampling by weighting site-year combinations according to the number of samples.

The following (rather complicated) function aggregates the data; in the course of our analysis we ended up needing to aggregate several different response variables according to several different sets of grouping variables. We re-centred the year to have year=0 at the beginning of the observation period.

aggfun <- function(dat,agg=c("Year","Site"),response="Winter.adj",
                   baseYear=min(mc1$Year)) {
    ## select only site, year, and response
    sub1 <- na.omit(dat[,c("Site","Year",response)])
    ## compute means of non-aggregation variables
    agg1 <- aggregate(sub1[!names(sub1) %in% agg],by=sub1[agg],FUN=mean)
    ## compute sample size of non-aggregation variables
    aggn <- aggregate(sub1[response],by=sub1[agg],FUN=length)
    names(aggn)[ncol(aggn)] <- "n"   ## assumes response is last column
    ## put mean and sample size together
    agg2 <- merge(agg1,aggn)
    ## recompute centred year
    agg2$cYear <- agg2$Year - baseYear
    agg2
}
mc2 <- aggfun(mc1,response="GS.NEE")
## center year at the mean rather than the date of
## the first observation:
mc2B <- aggfun(mc1,response="GS.NEE",baseYear=mean(mc1$Year))

The aggregated data show a similar picture:

ggplot(mc2,aes(x=cYear,y=GS.NEE,colour=Site))+
    geom_point(aes(size=n),alpha=0.7)+
    geom_smooth(method="lm",alpha=0.3,aes(weight=n))+
    scale_y_continuous(limits=c(-150,400),oob=squish)+
    scale_colour_discrete(guide="none")+
    scale_size_continuous(range=c(2,5),breaks=c(1,3,6))

Fitting

We use nlme::lme because at present it is the only easy way to allow for temporal autocorrelation in a LMM in R.

  • we use corCAR1, which implements a continuous-time first-order autocorrelation model (i.e. autocorrelation declines exponentially with time), because we have missing values in the data. The more standard discrete-time autocorrelation models (lme offers corAR1 for a first-order model and corARMA for a more general model) don’t work with missing data.
  • The weights=varFixed(~I(1/n)) specifies that the residual variance for each (aggregated) data point is inversely proportional to the number of samples.
  • The control argument lets the model try more iterations (otherwise we get an error).
cmod_lme <- lme(GS.NEE ~ cYear,
                data=mc2, method="REML",
                random = ~ 1 + cYear | Site, 
                correlation=corCAR1(form=~cYear|Site),
                weights=varFixed(~I(1/n)),
                control=list(maxIter=10000, niterEM=10000))

Model summary:

summary(cmod_lme)
## Linear mixed-effects model fit by REML
##  Data: mc2 
##       AIC      BIC   logLik
##   879.754 896.4282 -432.877
## 
## Random effects:
##  Formula: ~1 + cYear | Site
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev     Corr  
## (Intercept) 147.904084 (Intr)
## cYear         4.231723 -1    
## Residual     60.062783       
## 
## Correlation Structure: Continuous AR(1)
##  Formula: ~cYear | Site 
##  Parameter estimate(s):
##       Phi 
## 0.4039311 
## Variance function:
##  Structure: fixed weights
##  Formula: ~I(1/n) 
## Fixed effects: GS.NEE ~ cYear 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 108.88882  48.02267 57  2.267446  0.0272
## cYear        -3.85221   1.37468 57 -2.802256  0.0069
##  Correlation: 
##       (Intr)
## cYear -0.987
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -2.7595981 -0.4217636 -0.1201307  0.2685378  3.0854580 
## 
## Number of Observations: 82
## Number of Groups: 24

The results generally look sensible: the only warning sign is that the among-site variation in baseline NEE ((Intercept)) and the among-site variation in slope are perfectly correlated (i.e., the -1 term in the Corr column under Random effects). We weren’t happy with this, but we kept the full random effects model anyway. The alternative, dropping the random effect of year, seemed unacceptable in our case. Recentring the data at the mean year (\(\approx\) 1997) improves things slightly:

cmod2_lme <- update(cmod_lme,data=mc2B)

VarCorr() or summary() show that the intercept and slope random effects are still very strongly, but not perfectly, correlated (\(\rho=0.973\)); the fixed-effect intercept is very different (of course), and the year effect is almost identical, but its standard error is larger (so its \(p\)-value doubles).

##              Value Std.Error DF t-value p-value  
## (Intercept) -17.42      7.77 57   -2.24   0.029 *
## cYear        -3.84      1.51 57   -2.55   0.014 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also fit the model with lmer from the lme4 package: it’s faster and allows for crossed random effects (neither of which really matters here), but unfortunately it can’t incorporate temporal autocorrelation in the model:

cmod_lmer <- lmer(GS.NEE ~ cYear + (1+cYear|Site),
                data=mc2B, REML=TRUE,
                weights=n)
summary(cmod_lmer)
## Linear mixed model fit by REML ['lmerMod']
## Formula: GS.NEE ~ cYear + (1 + cYear | Site)
##    Data: mc2B
## Weights: n
## 
## REML criterion at convergence: 874.2
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -2.90210 -0.35039 -0.07972  0.30152  2.93146 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  Site     (Intercept)  116.02  10.771        
##           cYear         19.95   4.466   -1.00
##  Residual             3355.31  57.925        
## Number of obs: 82, groups:  Site, 24
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -16.296      7.338  -2.221
## cYear         -3.745      1.341  -2.792
## 
## Correlation of Fixed Effects:
##       (Intr)
## cYear -0.417

Note that weights here are specified as n rather than 1/n (varFixed() in the lme call specifies the variance, rather than the actual weights of different observations)

The results are not too different – the cYear fixed-effect slope is slightly smaller than for the lme fit (-3.75 vs. -3.84 (g C m^2 /year/season)/year), but the standard error is smaller, so the \(t\)-statistic is similar (-2.79 vs. -2.55).

We have to admit that the model we are using is just a little bit more complex than the data can easily handle, and especially that Toolik is quite different from the other sites … making the estimates somewhat unstable.

Diagnostics

Here are the default residuals-vs.-fitted plot; the scale-location plot (\(\sqrt{|\textrm{residual}|}\) vs. fitted); a plot of the residuals as a function of year; and the Q-Q plot.

colvec <- c("#ff1111","#007eff") ## second colour matches lattice default
grid.arrange(plot(cmod2_lme,type=c("p","smooth")),
             plot(cmod2_lme,sqrt(abs(resid(.)))~fitted(.),
                  col=ifelse(mc2$Site=="Toolik, AK",colvec[1],colvec[2]),
                  type=c("p","smooth"),ylab=expression(sqrt(abs(resid)))),
                                       ## "sqrt(abs(resid(x)))"),
             plot(cmod2_lme,resid(.,type="pearson")~cYear,
                  type=c("p","smooth")),
             qqnorm(cmod2_lme,abline=c(0,1),
                    col=ifelse(mc2$Site=="Toolik, AK",colvec[1],colvec[2])))