\documentclass{article} \title{Generalized linear models} \usepackage{hyperref} \usepackage{sober} \usepackage{url} \usepackage[english]{babel} %% to work around texi2dvi ~ bug \usepackage[utf8]{inputenc} %% to make fancy quotes etc. work \usepackage{fancyvrb} %% for \verb, chunks in footnotes \VerbatimFootnotes \usepackage{amsmath} %% boldmath etc. \usepackage{natbib} %% references \bibliographystyle{chicago} \title{Generalized linear models for disease ecologists} \author{Ben Bolker} \date{\today} \newcommand{\code}[1]{{\tt #1}} \newcommand{\fixme}[1]{{\textbf{FIXME: #1}}} \newcommand{\ex}[1]{{\textbf{Exercise: }}} \newcommand{\pkglink}[1]{\href{http://cran.r-project.org/web/packages/#1}{\nolinkurl{#1}}} \newcommand{\rflink}[1]{\href{https://r-forge.r-project.org/projects/#1/}{\nolinkurl{#1}}} \begin{document} \maketitle \includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png} \begin{minipage}[b]{3in} {\tiny Licensed under the Creative Commons attribution-noncommercial license (\url{http://creativecommons.org/licenses/by-nc/3.0/}). Please share \& remix noncommercially, mentioning its origin.} \end{minipage} %% Sweave/knitr options %% default size as created by R, and output size, center alignment %% leave code alignment/breaks as they are \SweaveOpts{ fig.width=4,fig.height=4,out.width="\\textwidth", fig.align="center",tidy=FALSE,echo=TRUE,message=FALSE,warning=FALSE} <>= ## stuff for making pretty PDFs -- **ignore** if running ## interactively ... if (require(knitr)) { ## require(highlight) knit_hooks$set( fig=function(before, options, envir) { if (before) { ## set base-graphics options par(bty="l",las=1) } else { } }, ## from Yihui Xie: https://gist.github.com/1805862 ## allow printing of external code chunks, highlighted, with line numbers printfun <- function(before, options, envir) { if (before) return() txt <- capture.output(dump(options$printfun, '')) ## reformat if tidy=TRUE if (options$tidy) txt = formatR::tidy.source(text = txt, output = FALSE)$text.tidy con = textConnection(txt) on.exit(close(con)) out = capture.output({ highlight::highlight(con, renderer = highlight::renderer_latex(document= FALSE), showPrompts = options$prompt, size = options$size,show_line_numbers=TRUE) }) paste(out, collapse = '\n') }) ## set null device so graphics device doesn't keep popping up on the ## screen when knitting olddev <- options(device = function(...) { .Call("R_GD_nullDevice", PACKAGE = "grDevices") }) } library(ggplot2) theme_update(theme_bw()) library(grid) zmargin <- opts(panel.margin=unit(0,"lines")) @ \section{Introduction} You will need the \code{bbmle}, \code{ggplot2}, and \code{coefplot2} packages installed, and optionally \code{glmmML}, \code{lme4}, and \code{glmmADMB}. \code{coefplot2} and \code{glmmADMB} need to be installed via <>= install.packages("coefplot2",repos="http://r-forge.r-project.org") install.packages("glmmADMB",repos="http://r-forge.r-project.org") @ \paragraph{Why linear models?} Stats 101: everything is normally distributed (and the residuals have constant variance). If the response variable depends on a \emph{continuous} (as opposed to \emph{categorical}) predictor variable, then the relationship is linear. Linear models are: \begin{itemize} \item Fast and numerically stable (works for huge and/or wonky data sets) \item No need for starting values \item Easy to account for \emph{random effects} (experimental blocks etc.) \item Lots of data (especially in economics, business, etc.) is reasonably normal \item Transformations can often fix problems with heteroscedasticity (non-constant variance)/non-normality/non-linearity \item Assuming normality (equivalently \emph{least-squares} solutions) is \\ usually OK \emph{asymptotically} (large data sets) \end{itemize} \paragraph{Why not?} \begin{itemize} \item More specific models are more efficient/powerful \item Would like a model that reflects the data better \item Some data (discrete, zero-rich) are resistant to transformation \item Understanding \emph{effect sizes} on transformed scales is hard \item Linear relationships often force out-of-bounds predictions (proportions outside $\{0,1\}$; negative counts) \item If you need arguments in favor of GLMs: \cite{ohara_not_2010,warton_arcsine_2011} \end{itemize} Generalized linear models (GLMs) allow (some) non-normal data (e.g. Poisson, binomial) and (some) nonlinear relationships (e.g. exponential, logistic curves). Retain advantages of linear models (fast, stable, usually don't need starting values \ldots) \paragraph{Basic definition} Need to specify \begin{itemize} \item distribution (\emph{family}) (e.g. Poisson=count data, binomial=proportion [count!] data) \item \emph{link function} (e.g. log, logit): linearizing transformation (don't actually transform data) \item response variable and \emph{continuous} and \emph{categorical} predictors (R formulas: \verb+r~x+, \verb!r~x+y! (additive model), \verb!r~x*y! (interaction)) \end{itemize} Model is linear on scale of link function. Almost all GLMs are logistic regressions. <>= library(stringr) sscrape <- function(string="logistic+regression") { sstring0 <- "http://scholar.google.ca/scholar?as_q=&num=10&btnG=Search+Scholar&as_epq=STRING&as_oq=&as_eq=&as_occt=any&as_sauthors=&as_publication=&as_ylo=&as_yhi=&as_sdt=1.&as_sdtp=on&as_sdts=5&hl=en" sstring <- sub("STRING",string,sstring0) rr <- readLines(url(sstring)) ## rr2 <- rr[grep("[Rr]esults",rr)[1]] rr2 <- rr rstr <- gsub(",","", gsub(".+$","", gsub("^.+[Rr]esults.+of about ","",rr2))) rstr <- str_extract(rr2,"About [0-9,]+ results") as.numeric())) } @ <>= fn <- "gscrape.RData" ## could use a caching solution for Sweave (cacheSweave, weaver package, ## pgfSweave ... but they're all slightly wonky with keep.source at ## the moment if (!file.exists(fn)) { gscrape <- sapply(c("generalized+linear+model", "logistic+regression","Poisson+regression","binomial+regression"),sscrape) save("gscrape",file=fn) } else load(fn) @ These data were scraped from Google Scholar hits on the relevant search terms. <>= d <- data.frame(n=names(gscrape),v=gscrape) d$n <- reorder(d$n,d$v) ggplot(d,aes(x=v,y=n))+geom_point(size=5)+ xlim(0.5e4,2e6)+ scale_x_log10(limits=c(1e4,2e6))+ geom_text(aes(label=v),colour="red",vjust=2)+ labs(y="",x="Google Scholar hits") @ \begin{itemize} \item logistic regression: \emph{logit} link (logistic inverse-link) \item Poisson regression: \emph{log} link (exponential inverse-link) \end{itemize} \section{Logistic regression} \subsection{Preliminaries} Binary (or maybe binomial) data. Logit link (others are possible but usually impossible to distinguish based on data). Read in the data and look at it: <>= dat <- read.table("gophertortoise.txt",header=TRUE) dat2 <- droplevels(subset(dat,Sex %in% c("M","F"))) (gplot1 <- ggplot(dat2,aes(size,status,colour=Sex))+ stat_sum(alpha=0.5)) @ Visualizing binary data is tricky --- here's one way: <>= dat2$sizecat <- cut(dat2$size,breaks=c(seq(200,280,by=20),320)) proptab <- ddply(dat2,c("sizecat","Sex"), function(x) { data.frame(n=nrow(x), size=mean(x$size), status=mean(x$status)) }) gplot1 + geom_point(data=proptab,aes(size=n),shape=2) @ \subsection{Picture and basic fit} We can quickly get \code{ggplot} to add a GLM fit to the \code{gplot1} graph by adding \verb+geom_smooth(method="glm",family=binomial)+ (try it!). However, this is only convenient for looking at pictures (not for testing hypotheses, finding good predictive models, etc.). Try a basic \code{glm} fit: <<>>= (mod1 <- glm(status~size*Sex,family=binomial,data=dat2)) @ \subsection{Single-model methods} What does R know how to do with \code{mod1}? Try this: <>= class(mod1) methods(class="glm") @ <>= coef(mod1) ## coefficients summary(mod1) ## various summary info coef(summary(mod1)) ## just the coefficient table with p-values etc. confint(mod1) ## profile confidence intervals fitted(mod1) ## fitted values predict(mod1) ## predictions ON LINEAR PREDICTOR scale ## ... on response scale (probabilities) predict(mod1,type="response") pframe <- data.frame(size=250,Sex="M") predict(mod1,newdata=pframe) ## predictions for new data ## simulated data from the fitted model simulate(mod1) residuals(mod1) ## residuals(deviance) plot(mod1) ## these are terrible! AIC(mod1) deviance(mod1) ## -2 log L logLik(mod1) ## SEQUENTIAL analysis of deviance anova(mod1,test="Chisq") drop1(mod1,test="Chisq") ## drop single terms and test ## marginal tests (danger Will Robinson!) drop1(mod1,test="Chisq",scope=.~.) @ Notes and pitfalls: \begin{itemize} \item the $p$-values given by \code{summary} etc. are approximate (Wald tests), those from \code{drop1}/\code{anova} (likelihood ratio tests) are better (although still approximate) \item \code{predict} gives predictions on the linear predictor (logit) scale by default (not probabilities): use \code{type="response"} for probabilities \item diagnosing lack of fit etc. is hard for binary data. One possibility is to compare the GLM fit with a non-parametric fit, as follows: <>= library(scales) gplot1+geom_smooth(method="glm",family="binomial")+ geom_smooth(linetype=2,fill="purple",alpha=0.1)+ scale_y_continuous(limits=c(-0.05,1.05),oob=squish) @ (The standard goodness-of-fit test for logistic regression, Hosmer-Lemeshow, has some problems \citep{hosmer_comparison_1997}; an improved version is available by using \code{lrm} in the \pkglink{rms} package and using \code{resid(f, 'gof')}) \item \code{anova} actually gives an analysis of \emph{deviance}, and it is \textbf{sequential} (``type I'' in SAS language) \item \verb+drop1(glm_fit,scope=.~.)+ does a marginal (``type III'') analysis, which has its own issues when there are interactions in the model (you need to be \emph{very} careful interpreting the meaning of the main effects) \end{itemize} \subsection{Interpreting parameters} What do the parameters mean??? People like me are always complaining that researchers should consider \emph{effect size}, the biological significance of the parameters, not just the statistical significance. In order to do this, you need to understand what the parameters mean. There are two hard parts of interpreting GLM parameters: (1) contrasts (how R parameterizes the differences between groups) (this issue is general to all modeling in R) and (2) the log-odds scale (this is specific to logistic regression). <<>>= coef(mod1) @ \begin{description} \item[intercept] ($\approx -10$): the logit probability (log-odds) of seropositivity for a female with size$=0$ (i.e., log-odds in the \emph{baseline} condition) \item[size] ($\approx 0.05$): increase in log-odds per unit (mm) of size, \emph{for females} (baseline) \item[SexM] ($\approx 6.6$): difference between females and males at size 0 \item[size:sexM] ($\approx -0.02$): difference between male and female slope, on the log-odds scale \end{description} <<>>= ## probability of infection of a female, size=0 plogis(-10.7) ## probability of infection of a female, size=100 plogis(-10.7+0.048*100) ## probability of infection of a male, size=0 plogis(-10.7+6.6) ## probability of infection of a male, size=100 plogis(-10.7+6.6+(0.048-0.023)*100) @ <>= dat2$csize <- dat2$size-250 mod1c <- update(mod1,.~csize*Sex) @ \ex{} Confirm for yourself that the \code{Intercept} and \code{SexM} terms have changed, but not the slopes. Sometimes it's more useful to scale the continuous predictors as well; one easy procedure is to center (subtract the mean) and divide by the standard deviation of the predictor, which puts all of the predictors on a similar scale \citep{schielzeth_simple_2010}. (The \code{scale} function in R also does this, but with some side effects we don't want.) <<>>= dat2$scsize <- with(dat2,(size-mean(size))/sd(size)) mod1sc <- update(mod1,.~scsize*Sex) @ Which of the following plots is more useful? <>= library(coefplot2) par(mfrow=c(1,2)) coefplot2(mod1) coefplot2(mod1sc) @ \ex{1}. Change the baseline level from females to males (the default is alphabetical) by using <>= dat2$Sex <- relevel(dat2$Sex,"M") @ Refit either the scaled or the unscaled model and convince yourself that you understand how the parameters have changed (and that the overall meaning of the model has not changed). \subsection{Comparing models} Most GLM inference etc. is based on comparing \emph{multiple} models rather than a single model (the likelihood ratio test. Allow for the interaction of age $\times$ sex: <<>>= mod2sc <- update(mod1sc,.~.-scsize:Sex) mod3sc <- update(mod1sc,.~scsize) mod4sc <- update(mod1sc,.~Sex) mod5sc <- update(mod1sc,.~1) anova(mod1sc,mod2sc,mod3sc,mod5sc,test="Chisq") @ (the $p$-values are the tests of each model against the next, i.e. significance tests of the parameters that are being dropped) Or: <<>>= library(bbmle) AICtab(mod1sc,mod2sc,mod3sc,mod4sc,weights=TRUE) @ Or: <>= coefplot2(list(mod1sc,mod2sc,mod3sc,mod4sc), legend=TRUE,legend.x="topright", col=c(1,2,4,5)) @ (I should probably have named these models more informatively!) Model comparison: <>= pframe <- expand.grid(scsize=seq(min(dat2$scsize),max(dat2$scsize),length=101), Sex=levels(dat2$Sex)) pframe$size <- pframe$scsize*sd(dat2$size)+mean(dat2$size) pframe$mod1sc <- predict(mod1sc,newdata=pframe,type="response") pframe$mod2sc <- predict(mod2sc,newdata=pframe,type="response") pframe$mod3sc <- predict(mod3sc,newdata=pframe,type="response") pframe$mod4sc <- predict(mod4sc,newdata=pframe,type="response") pframe$mod5sc <- predict(mod5sc,newdata=pframe,type="response") library(reshape2) pframe2 <- melt(pframe,id.var=1:3,value.name="status") ggplot(dat2,aes(x=size,y=status,colour=Sex))+stat_sum(alpha=0.5,aes(size=..n..), show_guide=FALSE)+ geom_line(data=pframe2)+facet_grid(.~variable)+zmargin+ scale_x_continuous(breaks=c(220,280)) @ \ex{}: convince yourself that the overall fit of the model (based on \code{logLik}) and the likelihood ratio test of the full model (comparing the full model with the null/constant model) do not change when you change the parameterization (either by scaling size, or by changing the baseline level). The logit/log-odds scale is initially hard to understand, but it has really useful properties for estimating changes in risk. The hardest part is that the effects of changes in a predictor on probability depending on the baseline risk. If you need to, you can use the \code{dredge} function from the \pkglink{MuMIn} package to fit all subsets of a model \ldots Some rules of thumb: \begin{itemize} \item When the starting probability is very low, the logistic curve is approximately exponential, so parameters approximately describe proportional changes (e.g. parameter of 0.1 $\approx$ 10\% (\emph{relative}) increase in probability per unit change) \item when the starting probability is intermediate (say 0.3-0.7), the \emph{absolute} change in probability per unit change is $r/4$ ($\to$ parameter of 0.1 implies 0.025 increase in probability per unit change) \item when the starting probability is high the change in complementary risk (probability of the event \emph{not} happening) changes proportionally \end{itemize} \subsection{Binomial regression} If you have $N>1$ per category, you can either specify the results \begin{itemize} \item as $(k,N-k)$ (e.g. \verb!cbind(num_dead,num_alive)~x+y+z!); you would often compute this on the fly, as \verb!cbind(num_dead,num_total-num_dead)~x+y+z!). \item with the \code{weights} specification, e.g. \verb!prop~x+y+z,weights=num_total! or \verb!num_dead/num_tot~x+y+z,weights=num_total! \end{itemize} \ex{}: aggregate the data by size class. <<>>= library(plyr) sizetab <- ddply(dat2,c("size","Sex"), function(x) c(tot=nrow(x),pos=sum(x$status))) sizetab$prop <- sizetab$pos/sizetab$tot @ use \code{sizetab} to run the analysis above as a binomial model. The parameters should stay the same, and the \emph{differences} in log-likelihood, deviance, and AIC among models should be the same, but the baseline values of log-likelihood etc. will differ. \section{Poisson/negative binomial regression} Gopher tortoise shell data. <<>>= load("gopherdat2.RData") head(Gdat) @ <>= ggplot(Gdat,aes(x=prev,y=shells/(Area*density), colour=factor(year),group=Site))+ geom_text(aes(label=Site))+geom_line(colour="gray") @ Fit a Poisson model (log link), with an \emph{offset} of area times density. An offset adds a term to the linear predictor, without fitting the parameters: if $P$=prevalence, $a$=area, $d$=density \begin{equation*} \begin{split} \log(\lambda) & = \beta_0+\beta_1 P + \log(a d) \\ \log(\lambda)- \log(ad) & = \beta_0+\beta_1 P \\ \log\left(\frac{\lambda}{ad}\right) & = \beta_0+\beta_1 P \end{split} \end{equation*} <<>>= m_pois_prev <- glm(shells~prev+offset(log(Area*density)),data=Gdat) @ A Poisson model assumes variance=mean, which is usually wrong. There is often \emph{overdispersion}: we can test for it by looking at the sum of squared Pearson residuals ($\sum (\mbox{expected}_i-\mbox{obs}_i)^2/\mbox{var}_i$, where $\mbox{var}_i$ is the expected variance of point $i$ based on the model), which should be (\emph{approximately}) $\sim \chi^2_{n-p}$ if the data are really Poisson: <>= devsq <- sum(residuals(m_pois_prev,type="pearson")^2) devsq/m_pois_prev$df.resid ## ratio should be approx. 1 pchisq(devsq,df=m_pois_prev$df.resid,lower.tail=FALSE) @ There are (at least) two ways to account for the overdispersion, by fitting a quasi-likelihood model <<>>= library(MASS) m_quasi_prev <- update(m_pois_prev,family="quasipoisson") m_nb_prev <- glm.nb(shells~prev+offset(log(Area*density)),data=Gdat) @ \ex{} compare these three models (Poisson, quasi-Poisson, negative binomial) looking at \code{coef(summary(fit))} and at \code{coefplot2(list(model1,model2,model3), xlim=c(-0.01,0.08))}. What do you conclude about the coefficient estimates? The \pkglink{aod} package incorporates a wider spectrum of methods and tests for dealing with overdispersion (which, oddly, is more widely recognized in Poisson than in binomial models). \section{Intermediate topics} \subsection{Offsets} Offsets add known components to the model (as shown above). They're most commonly used to account for unequal sampling areas/times, in cases where we expect results to be \emph{strictly proportional} to the offset (time, area sampled). This is often a way to handle ratios (e.g. sibling negotiations per chick in a study of begging behavior by owlets) without losing the discrete nature of the response. Can also be used in tricky ways, e.g. to fit the Ricker model. Suppose we think $N(t+1)=a N(t) e^{-b N(t)}$. On the log scale this is $$ \log N(t+1) = \log a + \log N(t) - b N(t) $$ This is a linear equation in $N(t)$ plus an offset of $\log N(t)$. In other words, if we use a log link then we can fit \verb!N ~ Nprev + offset(log(Nprev))! (The intercept term $\log a$ and the slope $-b$ are implicit in the R formula.) \ex{} What model would we fitting if we included $\log N$ as a variable rather than an offset? \subsection{Alternative link functions} Don't have to use the standard link functions. We could try to use them to get a slightly better fit to the shape of the nonlinearity, but I use them more often as a slightly tricky way to fit more \emph{mechanistic} models (see e.g. \cite{strong_model_1999}). Warning: you're more likely to run into convergence problems, be asked to specify starting values, etc. when using non-standard link functions. \begin{itemize} \item Holling type~II functional response via the \emph{inverse} link: if number eaten is $N_e=aN/(1+ahN)$, then risk of being eaten is $p=N_e/N= a/(1+ahN)$. If we use the inverse link then we are fitting $(1/p)=1/a+(1/h)\cdot N$ --- this linear in $N$ \ldots (we can even fit Holling type~III responses by doing a bit more algebra and including $1/N$ as a predictor) %% p = (aN^2/(1+ahN^2))/N = aN/(1+ahN^2) %% 1/p =(1/h)N + (1/a)(1/N) \item Chain binomial: suppose the probability of infection if $p=1-\exp(-\beta I_t)$. Then $1-p=\exp(-\beta I_t)$, or in other words the log of the probability of \emph{not} being infected is $-\beta I_t$. If we have chain-binomial data, then, we can fit <>= glm(prob_not_inf ~ previous_inf-1, weights=previous_susc, family=binomial(link="log")) @ <<>>= load("cbsim.RData") cbsim$Sprev <- c(NA,cbsim$S[-nrow(cbsim)]) cbsim$Iprev <- c(NA,cbsim$I[-nrow(cbsim)]) cbsim$pnotinf <- cbsim$S/cbsim$Sprev cbplot <- ggplot(cbsim,aes(pnotinf,Iprev))+geom_point() fit1 <- glm(pnotinf ~ Iprev -1 , family=binomial(link="log"), weights=Sprev,data=cbsim) confint(fit1) ## cbplot+geom_smooth(method="glm",family=binomial(link="log"), ## formula=~x-1) pp <- profile(fit1) likdat <- data.frame(likdev=pp$Iprev$z^2,Iprev=pp$Iprev$par.vals[,"Iprev"]) ggplot(likdat,aes(x=Iprev,y=likdev))+geom_line() @ <<>>= ## chain-binomial data from Becker: cbdat <- data.frame(chains=c(1,11,111,12,1111,112,121,13,11111, 1112,1121,113,1211,122,131,14), n=c(423,131,36,24,14,8,11,3,4,2,2,2,3,1,0,0)) ## total 664 ## Greenwood, no slope; Reed-Frost, no intercept ## log(m_{ij} q_{ij})= log(m_{ij})+alpha_i+beta_i j @ This is discussed by \cite{becker_analysis_1989}; you can see some of it on Google books at \url{http://tinyurl.com/chainbinom} \item Another useful case is the \emph{complementary log-log} (\code{cloglog}) link with binomial data. Suppose we measure the amount of mortality over differing exposure periods of length $\Delta t$. If we use the \code{cloglog} link with an offset of $\log(\Delta t)$ we get the right behavior. The \code{cloglog} function is $C(x)=\log(-\log(1-\mu))$, its inverse is $C^{-1}(x) = 1-\exp(-\exp(x))$. Thus if we expect mortality $\mu$ over a period $\Delta t=1$ and the linear predictor $\eta=C^{-1}(\mu)$ then $C^{-1}(\eta+\log \Delta t)=(1-\exp(-\exp(\eta) \cdot \Delta t))$, which is what we want. \end{itemize} \subsection{Bias-reduced GLM} \emph{Separation of variables} refers to the situation where some threshold value of predictor(s) can perfectly separate the 0 and 1 responses --- in this case the maximum likelihood solution doesn't exist and \code{glm} breaks. The \pkglink{brglm} (bias-reduced GLM) and \pkglink{logistf} (Firth logistic) packages implement a specific solution to this problem; more generally, it tends to provide more reliable results for very small data sets (people tend to forget that GLM relies, more heavily than regular linear models, on asymptotic assumptions) --- but it currently works only for binomial models. Another option is to use Bayesian GLMs (e.g. \code{bayesglm} from the \pkglink{arm} package); adding even a weak Bayesian prior can stabilize the fit. \subsection{Generalized linear mixed models} This can be a fairly hairy topic in general (see \cite{bolker_generalized_2009}, and \url{http://glmm.wikidot.com/faq}, but at root the concepts are fairly simple: here we add a \emph{random effect} of tortoise ID to the model. What is a random effect? There are actually several possible, overlapping answers: \begin{itemize} \item Effects that are drawn (randomly?) from a larger population of possible effects \item Effects where we are interested in the distribution of the levels (variance among levels), rather than the values of specific levels \item Effects that are ``nuisance'' aspects of the experimental design \item Effects that we want to estimate with \emph{shrinkage}, i.e. pulling poorly estimated values toward the population average \end{itemize} In general fitting as random effects works best when we have many levels with small and uneven amounts of data per level; it works very poorly when there are fewer than 4--6 levels. Typical examples: experimental blocks (spatial or temporal); genotypes or individuals within populations; taxa (species, genera) within higher-level taxa (genera, families). The interactions of fixed and random effects get treated as random; for example, among-pond variation in trends over time (around the population-level average trend). <<>>= library(lme4) (mod1mix <- glmer(status~scsize*Sex+(1|TortID),family=binomial,data=dat2)) library(glmmML) (mod1mixB <- glmmML(status~scsize*Sex,cluster=TortID,family=binomial,data=dat2, method="ghq")) @ <>= coefplot2(list(GLM=mod1sc,GLMM=mod1mix,GLMM_AGQ=mod1mixB), legend=TRUE,legend.x="topright", col=c(1,2,4)) @ Things to keep in mind: \begin{itemize} \item getting reliable $p$ values etc. is hard, unless you have so many groups in your random effect ($>40$) that the finite-size correction doesn't matter \item it's easy to put make a model that's too complex to fit reliably. In the example above we have to use the more \item there are a variety of packages that can fit GLMMs, with overlapping capabilities (\pkglink{lme4}, \pkglink{glmmML}, \rflink{glmmADMB}, \pkglink{MCMCglmm}) --- if possible, it's good to fit difficult models with more than one package, to cross-check \end{itemize} \section{Top 10 GLM mistakes} \begin{itemize} \item applying discrete models (Poisson, binomial) to non-discrete data \item ignoring overdispersion \item equating negative binomial with binomial rather than Poisson \item using GLMs where linear models will do (i.e. \code{glm} instead of \code{lm}) (harmless but annoying) \item ignoring blocking factors (failing to use GLMMs where necessary) \item confusion in interpreting effects \item worrying about marginal rather than conditional distributions of data$^*$ \item applying $\pm$ standard errors \item using $(k,N)$ rather than $(k,N-k)$ in binomial models \item getting confused by predictions on the linear predictor scale \end{itemize} \section{Topics left out} \begin{itemize} \item Zero-inflated/hurdle models (\pkglink{pscl}, \pkglink{tweedie}, \pkglink{glmmADMB}); \item generalized additive models (\pkglink{mgcv}); \item penalized regressions and shrinkage methods (\pkglink{glmnet}, \pkglink{penalized}); \item polytomous/ordinal data; \item spatial/temporal/phylogenetic correlations; \item GLMs on \textbf{big} data (\pkglink{biglm}) \ldots \end{itemize} See also slides at \url{http://www.slideshare.net/bbolker/}, in particular \url{http://www.slideshare.net/bbolker/glms-and-extensions-in-r}, \url{http://www.slideshare.net/bbolker/trondheim-glmm}, \url{http://www.slideshare.net/bbolker/opensource-glmm-tools-7562082} \bibliography{../eeid} <>= options(device=olddev$device) @ \end{document}