Other packages we’ll use (might want to install them now if you haven’t already):
It is almost always a good idea to try to visualize the data before launching into formal analysis/model-building (although you may want to write down any specific hypotheses you have, first, as a way to protect against data snooping):
Basic visualization of the data:
Here’s what we’re doing here:
Looks like there is an effect of age (possibly saturating below 1?), a relatively linear effect of alcohol, and little effect of tobacco — at least at the marginal level, One point to notice is that there appears to be a bit of bimodality (e.g. in the second panel, alcohol use, at the top level there are zeroes and probabilities from 0.5 up, with a lot of values at 1.0); however, I believe this will turn out to be a result of looking at the marginal relationships rather than something fundamental to the data.
The fits here are nonparametric smooths. I believe that ggplot is internally converting the x-axis variables to numeric (from their original status as ordered factors, which for our current purposes are similar to factors (i.e., they will be treated as categorical predictor variables, so there is not any relationship between the predicted values at neighboring levels). So, to check this, I’ll use stat_summary to add points (and normal-based confidence intervals) to the existing plot. The mean_cl_normal function computes normal-based confidence intervals (you need the Hmisc package installed); lwd sets the line width and alpha=0.5 makes the points slightly transparent:
The smoothing that ggplot is doing here apparently isn’t having much effect: the only place that the red dots differ from the blue line/gray ribbon is between the second and third points in the age plot, where the smooth partially removes the apparent jump between age categories 2 (35–44) and 3 (45–54).
The other potential issues with the nonparametric smooth used by ggplot are that (1) it doesn’t enforce 0-1 bounds on the probabilities, (2) nor does it weight responses proportional to their expected binomial variance. Let’s see what difference we would see if we were to use glm (separately on each variable):
The GLM fit (using the default logit link associated with binomial data) differs quite a bit for the age relationship, and a little bit for low levels of tobacco use; it is right on for alcohol, although the confidence intervals are a bit narrower.
So far we should we have only been looking at the marginal effects of these three variables. It would also be nice to look for interactions; the natural thing to do is to take a single variable. I’ll start with tobacco use, to see if there are interactions hidden underneath the apparent lack of pattern in the marginal data:
There’s not too much going on here — at each level of alcohol use, the relationship with respect to tobacco use is more or less flat (perhaps there is a slight increase in the last category in panel 2 (40–79 g/day), but we are now slicing the data pretty finely).
Similar results if we plot the tobacco-disease relationship for different ages (I use coord_cartesian(ylim=...) to set the y-axis limits):
How many parameters would we need if we wanted to fit the saturated model, with one value for every age/alcohol/tobacco combination?
Oops: this is more than our total number of observations [nrow(esoph)=88]!
Since there’s no evidence of interactions, what if we instead fitted the additive model?
this is pushing it a bit in terms of the rule of thumb that the ratio of observations to estimated parameters should be 10–20 (we have 88/12=7.3, but for now we’ll go for it …
Looking at the coefficients:
As mentioned in the book, for ordered factors the parameterization uses polynomial contrasts (Linear, Quadratic, Cubic, and 4th and 5th order terms for age (note that these are all on the scale of the linear predictor, or in this case the logit scale). It looks like there’s something going on at the linear and quadratic levels for age, and linear trends for alcohol and tobacco. However, the tobacco trend is weakest both if we interpret the strength by the p-value (probably not a good idea), 0.0022, or (better) according to the size of the coefficient: 0.59, which would be proportional increase in probability of disease per tobacco-use category starting from low probability. Starting near 50% probability, this coefficient implies a change of ≈ 0.59∕4 = 0.15 in probability per tobacco-use category (look back at the figures to convince yourself that this makes sense). Compare this with the (logit-)linear effect of alcohol, 1.5 — more than twice as large.
If we want to look at the overall effect of each of the main effects:
Looking (very briefly) at penalized likelihood (“elastic net”) fitting, which I admit I don’t understand tremendously well myself, but is a sophisticated way of avoiding overfitting; it penalizes parameters in such a way that it tends to set small parameters to zero. The choice of penalty parameter (λ) is done by cross-validation (10-fold, by default in cv.glmnet).
Interestingly (to me), this approach didn’t shrink the variables as much as I thought it would.
Basic plot:
Nothing horrible appears here. (We don’t really expect the residuals to be normal; the smoothed lines in the residuals vs. fitted and scale-location plots look like something might be going on, but these lines are generally quite wiggly.)
Here are three points that appear to be fairly influential/unusual:
All are at the lowest level of tobacco use; two are at intermediate alcohol levels, two at the second-lowest level of age.
No evidence of overdispersion (checking two different approximations):
termplot plots the partial effects of each term in the model:
Here’s how we can generate predictions, by calling predict with se.fit=TRUE. By default, we get the predictions on the linear predictor scale, then we have to use the inverse link function (which is the logistic, which we can get via plogis in base R or via ilogit in the faraway package) to transform to the original scale. (We must compute the lower and upper confidence levels on the linear predictor scale and then back-transform; directly back-transforming the standard errors and then adding them makes no sense.) We can generate predictions directly on the response scale via type="response", but the results that we get from se.fit will represent a cruder approximation.
Plotting these predictions doesn’t work particularly well, because we end up plotting the predictions at all levels:
One way to do this is with the Predict function from the rms package. To use the Predict function reliably we have to set the contrasts for ordered contrasts (the second element in options("contrasts")) from polynomial to treatment contrasts; we also have to use the Glm function, an extended version of glm from the rms package, to fit the model:
(Haven’t figured out how to get the original data on this plot …)
exercise: figure out how to generate the equivalent of this plot with ggplot, with the predictions superimposed on the data. Use the original esoph data set, instead of the melted version. You might as well give the plots their usual orientation (i.e., response variable on the y axis). (Once you have subdivided the data this finely, you won’t need to use geom_smooth or stat_sum any more; just geom_point()+geom_line(aes(group=1)) should work fine.) Use agegp as your x-axis variable and prob as your y-axis variable. Other useful ingredients:
Another alternative is to facet (taking care of one extra dimension) and restrict the values to a particular value of age: in this case we have so little data left that the picture doesn’t look great (I am following the advice from the above exercise to eliminate geom_smooth and stat_sum):
exercise: by treating the predictors as categorical, we have allowed the fit to be completely flexible. As we can see post hoc, most of this flexibility appears to be unnecessary. “Backwards elimination” of all this complexity is very close to data snooping, but let’s suppose we want to try fitting a straightforward linear model to the data, getting rid of all that polynomial stuff.
The easiest way to do this is simply to convert all of the predictor variables from factors to numeric values (based on their underlying codes): you can use either e.g. esoph$agegp <- as.numeric(esoph$agegp) (or unclass() instead of as.numeric(); you can also use transform to do this a bit more elegantly).
Rerun the model; compare the models. Because polynomial contrasts were used, the linear (on the logit scale) model is strictly nested within the more complicated model, and you can use a likelihood ratio test via anova to compare the models. If they were not strictly nested, you could still use AIC() for comparisons.
You can also allow disease to be a quadratic function of age, e.g. by using poly(agegp,2) (or agegp+I(agegp^2). Compare all three models.
Exercise: work with one of the data sets from the faraway package. For each one, go through sensible steps (check the help page for the data set to find the description of the predictors; plot the data to make sure they are sensible and try to detect any issues with the data; fit a preliminary model, typically the largest model that seems reasonable to fit given the size of the data set; check diagnostics; interpret the results; plot sensible predictions of the model, with confidence intervals if possible).
We haven’t dealt with overdispersion very much, so you might want to avoid the turtle data set below unless you’re feeling ambitious.
I haven’t been through these data sets in detail myself, so if you run into trouble with one, please e-mail me immediately for guidance …
You may also use a data set of your own.