GLM basics

2011 Ben Bolker

February 10, 2011

PIC

Licensed under the Creative Commons attribution-noncommercial license (http://creativecommons.org/licenses/by-nc/3.0/). Please share & remix noncommercially, mentioning its origin.

1 Preliminaries

  > library(faraway)
  > data(esoph)
  > ## compute probability and add it to data set
  > esoph <- transform(esoph,prob=ncases/(ncases+ncontrols))

Other packages we’ll use (might want to install them now if you haven’t already):

  > library(ggplot2)
  > library(Hmisc)
  > library(rms)
  > library(glmnet)

  Loaded glmnet 1.5.2

  > ## library(gridExtra) ## maybe not

2 Visualization

It is almost always a good idea to try to visualize the data before launching into formal analysis/model-building (although you may want to write down any specific hypotheses you have, first, as a way to protect against data snooping):

Basic visualization of the data:

  > mm <- melt(esoph,id.var=4:6)
  > g1 <- ggplot(mm,aes(y=prob,x=value))+
     stat_sum(aes(size=..n..))+
     geom_smooth(aes(group=1))+
     facet_grid(.~variable,scale="free_x")
  > print(g1)

PIC

Here’s what we’re doing here:

Looks like there is an effect of age (possibly saturating below 1?), a relatively linear effect of alcohol, and little effect of tobacco — at least at the marginal level, One point to notice is that there appears to be a bit of bimodality (e.g. in the second panel, alcohol use, at the top level there are zeroes and probabilities from 0.5 up, with a lot of values at 1.0); however, I believe this will turn out to be a result of looking at the marginal relationships rather than something fundamental to the data.

The fits here are nonparametric smooths. I believe that ggplot is internally converting the x-axis variables to numeric (from their original status as ordered factors, which for our current purposes are similar to factors (i.e., they will be treated as categorical predictor variables, so there is not any relationship between the predicted values at neighboring levels). So, to check this, I’ll use stat_summary to add points (and normal-based confidence intervals) to the existing plot. The mean_cl_normal function computes normal-based confidence intervals (you need the Hmisc package installed); lwd sets the line width and alpha=0.5 makes the points slightly transparent:

  > print(g1 + stat_summary(fun.data=mean_cl_normal,col="red",lwd=1.5,alpha=0.5))

PIC

The smoothing that ggplot is doing here apparently isn’t having much effect: the only place that the red dots differ from the blue line/gray ribbon is between the second and third points in the age plot, where the smooth partially removes the apparent jump between age categories 2 (35–44) and 3 (45–54).

The other potential issues with the nonparametric smooth used by ggplot are that (1) it doesn’t enforce 0-1 bounds on the probabilities, (2) nor does it weight responses proportional to their expected binomial variance. Let’s see what difference we would see if we were to use glm (separately on each variable):

  > print(g1+geom_smooth(aes(group=1,weight=ncases+ncontrols),
               colour="red",fill="red",
               method="glm",family=binomial))

PIC

The GLM fit (using the default logit link associated with binomial data) differs quite a bit for the age relationship, and a little bit for low levels of tobacco use; it is right on for alcohol, although the confidence intervals are a bit narrower.

So far we should we have only been looking at the marginal effects of these three variables. It would also be nice to look for interactions; the natural thing to do is to take a single variable. I’ll start with tobacco use, to see if there are interactions hidden underneath the apparent lack of pattern in the marginal data:

  > print(ggplot(esoph,aes(y=prob,x=tobgp))+
     stat_sum(aes(size=..n..))+
     geom_smooth(aes(group=1))+
     facet_grid(.~alcgp,scale="free_x"))

PIC

There’s not too much going on here — at each level of alcohol use, the relationship with respect to tobacco use is more or less flat (perhaps there is a slight increase in the last category in panel 2 (40–79 g/day), but we are now slicing the data pretty finely).

  > print(ggplot(esoph,aes(y=prob,x=tobgp))+
     stat_sum(aes(size=..n..))+
     geom_smooth(aes(group=1))+
     facet_grid(.~agegp,scale="free_x")+
     coord_cartesian(ylim=c(-0.1,1.1)))

PIC

Similar results if we plot the tobacco-disease relationship for different ages (I use coord_cartesian(ylim=...) to set the y-axis limits):

3 Model fitting

How many parameters would we need if we wanted to fit the saturated model, with one value for every age/alcohol/tobacco combination?

  > ncol(model.matrix(~agegp*alcgp*tobgp,data=esoph))

  [1] 96

Oops: this is more than our total number of observations [nrow(esoph)=88]!

Since there’s no evidence of interactions, what if we instead fitted the additive model?

  > (np <- ncol(model.matrix(~agegp+alcgp+tobgp,data=esoph)))

  [1] 12

this is pushing it a bit in terms of the rule of thumb that the ratio of observations to estimated parameters should be 10–20 (we have 88/12=7.3, but for now we’ll go for it …

  > m1 <- glm(cbind(ncases,ncontrols)~agegp+alcgp+tobgp,
       data=esoph,family=binomial)

Looking at the coefficients:

  > summary(m1)

  Call:
  glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp + tobgp,
      family = binomial, data = esoph)
  
  Deviance Residuals:
      Min       1Q   Median       3Q      Max
  -1.6891  -0.5618  -0.2168   0.2314   2.0642
  
  Coefficients:
              Estimate Std. Error z value Pr(>|z|)
  (Intercept) -1.77997    0.19796  -8.992  < 2e-16 ***
  agegp.L      3.00534    0.65215   4.608 4.06e-06 ***
  agegp.Q     -1.33787    0.59111  -2.263  0.02362 *
  agegp.C      0.15307    0.44854   0.341  0.73291
  agegp^4      0.06410    0.30881   0.208  0.83556
  agegp^5     -0.19363    0.19537  -0.991  0.32164
  alcgp.L      1.49185    0.19935   7.484 7.23e-14 ***
  alcgp.Q     -0.22663    0.17952  -1.262  0.20680
  alcgp.C      0.25463    0.15906   1.601  0.10942
  tobgp.L      0.59448    0.19422   3.061  0.00221 **
  tobgp.Q      0.06537    0.18811   0.347  0.72823
  tobgp.C      0.15679    0.18658   0.840  0.40071
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  
  (Dispersion parameter for binomial family taken to be 1)
  
      Null deviance: 227.241  on 87  degrees of freedom
  Residual deviance:  53.973  on 76  degrees of freedom
  AIC: 225.45
  
  Number of Fisher Scoring iterations: 6

As mentioned in the book, for ordered factors the parameterization uses polynomial contrasts (Linear, Quadratic, Cubic, and 4th and 5th order terms for age (note that these are all on the scale of the linear predictor, or in this case the logit scale). It looks like there’s something going on at the linear and quadratic levels for age, and linear trends for alcohol and tobacco. However, the tobacco trend is weakest both if we interpret the strength by the p-value (probably not a good idea), 0.0022, or (better) according to the size of the coefficient: 0.59, which would be proportional increase in probability of disease per tobacco-use category starting from low probability. Starting near 50% probability, this coefficient implies a change of 0.594 = 0.15 in probability per tobacco-use category (look back at the figures to convince yourself that this makes sense). Compare this with the (logit-)linear effect of alcohol, 1.5 — more than twice as large.

If we want to look at the overall effect of each of the main effects:

  > drop1(m1,test="Chisq")

  Single term deletions
  
  Model:
  cbind(ncases, ncontrols) ~ agegp + alcgp + tobgp
         Df Deviance    AIC    LRT   Pr(Chi)
  <none>      53.973 225.45
  agegp   5  131.484 292.96 77.511 2.782e-15 ***
  alcgp   3  120.028 285.51 66.054 2.984e-14 ***
  tobgp   3   64.572 230.05 10.599   0.01411 *
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

3.1 Penalized fitting/lasso/elastic net (optional)

Looking (very briefly) at penalized likelihood (“elastic net”) fitting, which I admit I don’t understand tremendously well myself, but is a sophisticated way of avoiding overfitting; it penalizes parameters in such a way that it tends to set small parameters to zero. The choice of penalty parameter (λ) is done by cross-validation (10-fold, by default in cv.glmnet).

  > m2 <- glmnet(x=model.matrix(~agegp+alcgp+tobgp,data=esoph),
          y=as.matrix(esoph[c("ncontrols","ncases")]),
          family="binomial")
  > c2 <- cv.glmnet(x=model.matrix(~agegp+alcgp+tobgp,data=esoph),
          y=as.matrix(esoph[c("ncases","ncontrols")]),
          family="binomial")
  > m2_coef <- coef(m2,s=c2$lambda.min)

  > cm <- data.frame(model=names(coef(m1)),glm=coef(m1),glmnet=m2_coef[-2])
  > print(qplot(value,model,data=melt(cm),colour=variable,shape=variable)+
         geom_vline(xintercept=0,lty=2))

Interestingly (to me), this approach didn’t shrink the variables as much as I thought it would.

4 Diagnostics

Basic plot:

  > op <- par(mfrow=c(2,2),las=1,
             mgp=c(2.5,0.75,0),
             mar=c(4,3.5,1,0.5))
  > plot(m1)
  > par(op)

PIC

Nothing horrible appears here. (We don’t really expect the residuals to be normal; the smoothed lines in the residuals vs. fitted and scale-location plots look like something might be going on, but these lines are generally quite wiggly.)

Here are three points that appear to be fairly influential/unusual:

  > esoph[c(16,20,67),]

     agegp     alcgp    tobgp ncases ncontrols      prob
  16 35-44 0-39g/day 0-9g/day      0        60 0.0000000
  20 35-44     40-79 0-9g/day      0        35 0.0000000
  67 65-74     40-79 0-9g/day     17        34 0.3333333

All are at the lowest level of tobacco use; two are at intermediate alcohol levels, two at the second-lowest level of age.

No evidence of overdispersion (checking two different approximations):

  > deviance(m1)/df.residual(m1)

  [1] 0.7101747

  > sum(residuals(m1,type="pearson")^2)/df.residual(m1)

  [1] 0.7863115

termplot plots the partial effects of each term in the model:

  > op <- par(mfrow=c(1,3))
  > termplot(m1,se=TRUE,col.se="gray")
  > par(op)

PIC

Here’s how we can generate predictions, by calling predict with se.fit=TRUE. By default, we get the predictions on the linear predictor scale, then we have to use the inverse link function (which is the logistic, which we can get via plogis in base R or via ilogit in the faraway package) to transform to the original scale. (We must compute the lower and upper confidence levels on the linear predictor scale and then back-transform; directly back-transforming the standard errors and then adding them makes no sense.) We can generate predictions directly on the response scale via type="response", but the results that we get from se.fit will represent a cruder approximation.

  > gp <- with(esoph,expand.grid(alcgp=levels(alcgp),
                                agegp=levels(agegp),
                                tobgp=levels(tobgp)))
  > pp <- predict(m1,newdata=gp,se.fit=TRUE)
  > gp$prob <- plogis(pp$fit)
  > gp$p_lo <- with(pp,plogis(fit-1.96*se.fit))
  > gp$p_hi <- with(pp,plogis(fit+1.96*se.fit))

Plotting these predictions doesn’t work particularly well, because we end up plotting the predictions at all levels:

  > gpm <- melt(gp,id.var=4:6)
  > g3 <- ggplot(mm,aes(y=prob,x=value))+
     stat_sum(aes(size=..n..))+
     geom_smooth(aes(group=1))+
     facet_grid(.~variable,scale="free_x")+
     geom_point(data=gpm,colour="red")
  > print(g3)

PIC

One way to do this is with the Predict function from the rms package. To use the Predict function reliably we have to set the contrasts for ordered contrasts (the second element in options("contrasts")) from polynomial to treatment contrasts; we also have to use the Glm function, an extended version of glm from the rms package, to fit the model:

  > options(contrasts=c("contr.treatment","contr.treatment"))
  > M1 <- Glm(cbind(ncases,ncontrols)~agegp+alcgp+tobgp,
       data=esoph,family=binomial)

  > pp1 <- with(esoph,plot(Predict(M1,agegp=levels(agegp),alcgp=levels(alcgp),
                                  tobgp=levels(tobgp)),data=esoph))
  > print(pp1)

PIC

  > ## set options back to default
  > options(contrasts=c("contr.treatment","contr.poly"))

(Haven’t figured out how to get the original data on this plot …)

exercise: figure out how to generate the equivalent of this plot with ggplot, with the predictions superimposed on the data. Use the original esoph data set, instead of the melted version. You might as well give the plots their usual orientation (i.e., response variable on the y axis). (Once you have subdivided the data this finely, you won’t need to use geom_smooth or stat_sum any more; just geom_point()+geom_line(aes(group=1)) should work fine.) Use agegp as your x-axis variable and prob as your y-axis variable. Other useful ingredients:

Another alternative is to facet (taking care of one extra dimension) and restrict the values to a particular value of age: in this case we have so little data left that the picture doesn’t look great (I am following the advice from the above exercise to eliminate geom_smooth and stat_sum):

  > gps <- subset(gp,agegp=="35-44")
  > print(ggplot(subset(esoph,agegp=="35-44"),
                aes(y=prob,x=tobgp))+
         geom_point()+geom_line(aes(group=1))+
         facet_grid(.~alcgp,scale="free_x")+
         geom_line(data=gps,aes(group=1),colour="red")+
         geom_ribbon(data=gps,
                     aes(group=1,ymin=p_lo,ymax=p_hi),
                     fill="red",alpha=0.2,colour=NA))

PIC

exercise: by treating the predictors as categorical, we have allowed the fit to be completely flexible. As we can see post hoc, most of this flexibility appears to be unnecessary. “Backwards elimination” of all this complexity is very close to data snooping, but let’s suppose we want to try fitting a straightforward linear model to the data, getting rid of all that polynomial stuff.

The easiest way to do this is simply to convert all of the predictor variables from factors to numeric values (based on their underlying codes): you can use either e.g. esoph$agegp <- as.numeric(esoph$agegp) (or unclass() instead of as.numeric(); you can also use transform to do this a bit more elegantly).

Rerun the model; compare the models. Because polynomial contrasts were used, the linear (on the logit scale) model is strictly nested within the more complicated model, and you can use a likelihood ratio test via anova to compare the models. If they were not strictly nested, you could still use AIC() for comparisons.

You can also allow disease to be a quadratic function of age, e.g. by using poly(agegp,2) (or agegp+I(agegp^2). Compare all three models.

Exercise: work with one of the data sets from the faraway package. For each one, go through sensible steps (check the help page for the data set to find the description of the predictors; plot the data to make sure they are sensible and try to detect any issues with the data; fit a preliminary model, typically the largest model that seems reasonable to fit given the size of the data set; check diagnostics; interpret the results; plot sensible predictions of the model, with confidence intervals if possible).

We haven’t dealt with overdispersion very much, so you might want to avoid the turtle data set below unless you’re feeling ambitious.

I haven’t been through these data sets in detail myself, so if you run into trouble with one, please e-mail me immediately for guidance …

You may also use a data set of your own.