---
title: "Model diagnostics example"
author: Ben Bolker
date: "`r format(Sys.time(), '%H:%M %d %B %Y')`"
output: html_document
bibliography: analysis.bib
---

In this exercise we'll fit a simple model and apply a variety of model diagnostics to it.  Some may be
very familiar ...

```{r opts,echo=FALSE}
library("knitr")
## ignore: knitr options
opts_chunk$set(tidy=FALSE,fig.width=5,fig.height=5)
knit_hooks$set(basefig=function(before, options, envir) {
                   if (before) {
                       par(bty="l",las=1)
                   } else { }
               })
```

```{r pkgs,message=FALSE,warning=FALSE}
library(arm) ## for sim()
library(descr)  ## for LogRegR2
require("reshape2")
## graphics prettiness
library("ggplot2")
theme_set(theme_bw()) 
library("grid")
zmargin <- theme(panel.margin=unit(0,"lines"))
```

Data on lizard perching behaviour, from the `brglm` package (and before that from McCullagh and Nelder [@mccullagh_generalized_1989], ultimately from Schoener [-@schoener_nonsynchronous_1970]).

```{r liz0,echo=FALSE,message=FALSE}
if (!file.exists("data/lizards.csv")) {
   require("brglm")
   data(lizards)
   ## compute total samples and fraction grahami for each observation
   lizards <- transform(lizards,N=grahami+opalinus,
                     gfrac=grahami/(grahami+opalinus))
   write.csv(lizards,file="data/lizards.csv")
}
```

```{r fakelizards}
lizards <- read.csv("data/lizards.csv")
## adjust factor levels to a sensible order
lizards$time <- factor(lizards$time,
                       levels=c("early","midday","late"))
```

A quick look at the data: response is fraction of *Anolis grahami* lizards found on perches in particular conditions.  Plot univariate responses:
```{r firstlook,echo=FALSE,message=FALSE,warning=FALSE,fig.height=4}
mvars <- c("height","diameter","light","time")
## reshape data: 4 copies of each observation, one for each predictor variable
mliz <- suppressWarnings(melt(lizards,id.vars="gfrac",
     measure.vars=mvars))
ggplot(mliz,aes(x=value,y=gfrac))+
  geom_boxplot(fill="lightgray")+
  facet_wrap(~variable,scale="free_x",nrow=1)+
  zmargin
```

A more conventional plot:
```{r lizfig2}
(g1 <- ggplot(lizards,
    aes(x=time,y=gfrac,colour=height))+
  geom_point(aes(size=N))+
  geom_line(aes(group=height))+
  facet_grid(diameter~light,labeller=label_both)+zmargin)
```

Fit a basic (additive: no interactions) binomial GLM:
```{r glmfit1}
m1 <- glm(gfrac~time+height+light+diameter,
    weights=N,
    family="binomial",
    data=lizards)
```

Standard R diagnostic plots (fitted vs. residual, scale-location,
Q-Q, influence):
```{r glmdiag1,fig.height=6,fig.width=6}
op <- par(mfrow=c(2,2))  ## 2x2 subplot grid
plot(m1)
par(op) ## restore original parameters
```

An improved Q-Q plot, from Augustin et al. [-@augustin_quantil_2012] by way of the `mgcv` package:
```{r glmdiag2,message=FALSE,basefig=TRUE}
library(mgcv)
qq.gam(m1,pch=16)
```

Check for overdispersion:
```{r overdisp}
resid.ssq <- sum(residuals(m1,type="pearson")^2)  ## sum of squares of Pearson resids
resid.df <- nrow(lizards)-length(coef(m1))        ## estimated resid df (N-p)
resid.ssq/resid.df                                ## ratio should be approx 1
```
Not overdispersed, apparently.

Compute various pseudo-$R^2$ measures:
```{r sumstats}
library(descr)
LogRegR2(m1)
```

Use `fortify(model_fit)` to add the standard diagnostics (fitted values,
residuals, standardized residuals, ...) to the data from a model
```{r fortify}
m1F <- fortify(m1)
ggplot(m1F,aes(x=time,y=.resid))+geom_boxplot()+
    geom_point(size=3,alpha=0.5)+
    facet_grid(diameter~light)+zmargin
```
Uh-oh ...

Other tests of distribution are a bit harder.

Or we can plot predicted values.
```{r pred}
lPred <- predict(m1,se.fit=TRUE)  ## gives predictions on logit scale
## compute back-transformed predictions and confidence intervals
lizardsX <- transform(lizards,pred=plogis(lPred$fit),
              lwr=plogis(lPred$fit-2*lPred$se.fit),
              upr=plogis(lPred$fit+2*lPred$se.fit))
## add predictions/CIs to previous plot
g1 + geom_pointrange(data=lizardsX,shape=2,
            aes(y=pred,ymin=lwr,ymax=upr))
```

When you have continuous predictors or more complicated/unbalanced
situations you will often want to construct your own data frame
for predictions.  For example, to get predictions for all 
light:time combinations for a specified diameter and height category:
```{r predframe,eval=FALSE}
predframe <- with(lizards,
                  expand.grid(light=levels(light),
                              time=levels(time),
                              diameter="<=2in",
                              height="<5ft"))
predict(m1,newdata=predframe)
```

Warning signs of two problems:

* *complete separation*: all-zero or all-one in some categories (bias-reduced regression via `logistf` or `brglm`, or regularization via Bayesian (`arm::bayesglm`) or other approaches)
* failure of the Wald approximation (*Hauck-Donner effect*, @hauck_walds_1977)

## Posterior predictive simulation

* Pick some value (*probe*) of interest

```{r postpred}
betasim <- sim(m1,n.sims=500)
X <- model.matrix(m1)
simpreds <- plogis(X %*% t(coef(betasim)))
subset(lizards,gfrac==1.0)
dim(simpreds)
hist(simpreds[4,],breaks=50,col="gray",
     main="Posterior pred sim")
```

## Bootstrapping
```{r loadboot}
bootfun <- function() {
  
  bsamp <- sample(nrow(lizards),
                   size=nrow(lizards),
                 replace=FALSE)
  bmodel <- update(m1,data=lizards[bsamp,])
  bpred <- predict(bmodel,type="response")
}
bootpred <- replicate(500,bootfun())
hist(bootpred[4,],breaks=20,col="gray")
``` 

## Cross-validation
```{r boot,message=FALSE}
library(boot)
```
Need to define
a *cost function* `cost(observed, fitted)`; default is avg squared error
```{r cv1}
cost <- function(r, pi = 0) mean(abs(r-pi)) ## use mean abs dev
cv1 <- cv.glm(lizards,m1)
str(cv1)
cv1$delta
```

Or do it by hand:
```{r cv2}
cverr <- numeric(nrow(lizards))
for (i in 1:nrow(lizards)) {
  cvdata <- lizards[-i,]
  cvmodel <- update(m1,data=cvdata)
  predval <- predict(cvmodel,newdata=lizards[i,],
                     type="response")
  cverr[i] <- cost(lizards$gfrac,predval)
}
hist(cverr,breaks=10,col="gray")
mean(cverr)
```

This is *leave-one-out* cross-validation: $K$-fold is usually
better (but maybe worth using `cv.glm` instead)


## Notes

* Problems in one area (outliers, non-linearity, etc.) can
* Transforming data is generally the best solution for continuous data, isn't really possible for non-continuous data; difficult for non-positive data
* Transforming data helps with skew, not with fat tails -- fat tails will give unbiased estimates, but poor confidence intervals; need *robust* statistics (`library("robust")`)
* Can't always fix everything, need to prioritize
* See also: [@wickham_graphical_2010]

## Exercises

* Change the model to incorporate two-way interactions (`m2 <- update(m1,.~.^2)`) and see if that seems to fix any problems we found in the model.  Compare this with the statistical significance of the added terms (`summary(m2)` or `drop1(m2,test="Chisq")`)

## References