\documentclass{article} \usepackage{sober} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage{fancyvrb} \VerbatimFootnotes \usepackage{amsmath} \usepackage{natbib} \usepackage{color} \usepackage{times} \bibliographystyle{chicago} \SweaveOpts{echo=T,results=verbatim,print=F,eps=T,pdf=T} \title{Likelihood and all that, for disease ecologists} \author{Ben Bolker (\url{bolker@mcmaster.ca}) and Steve Ellner and Matthew Ferrari?} \date{\today} \newcommand{\code}[1]{{\tt #1}} \newcommand{\R}{R} %% nothing fancy \newcommand{\exercise}{\textbf{Exercise}} \newcommand{\bbnote}[1]{\color{red} \small #1 \color{black}} \usepackage{epsfig} \begin{document} \maketitle \textbf{Last compiled:} \Sexpr{date()} \includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.eps} \begin{minipage}[b]{3in} {\tiny Licensed under the Creative Commons attribution-noncommercial license (\url{http://creativecommons.org/licenses/by-nc/3.0/}). Please share \& remix noncommercially, mentioning its origin.} \end{minipage} \tableofcontents %% required packages: ggplot2, plyr, emdbook, reshape, mgcv, bbmle ... \SweaveOpts{height=5,width=8,keep.source=TRUE} \setkeys{Gin}{width=\textwidth} <>= x11(width=10,height=6) ## hack for figure margins too small error options(continue=" ") @ <>= options(SweaveHooks=list( fig=function() par(mfrow=c(1,2), bty="l", las=1) )) @ <>= options(SweaveHooks=list( fig=function() { par(mfrow=c(2,2), bty="l", las=1)} )) @ <>= ## DM original: 5,13,4,11 options(SweaveHooks=list( fig=function() { par(mar=c(5,10,4,10)+0.1, bty="l", las=1) })) @ \section{Why likelihood?} %\bbnote{Where should this go?} Why go to the trouble of defining likelihood etc. rather than using good old fashioned ANOVA, $t$ tests, nonparametric tests, etc. \ldots? \begin{itemize} \item \textbf{Likelihood-based approaches are flexible and can be based on mechanistic models}. Parameters are meaningful, and statistical hypotheses often correspond closely with biological hypotheses. One can often adapt likelihood models to estimate the parameters of specific, theoretically based biological models. \item \textbf{Many (most) common statistical approaches represent special cases of likelihood-based approaches}. It is a unifying framework into which most statistical techniques fit. For example, the MLE is equivalent to the least-squares fit for normal, homoscedastic, independent data. \item \textbf{Likelihood-based approaches focus on parameter and confidence interval estimation (rather than $p$ values and null hypothesis testing)}. We will see later how to estimate parameters and compute confidence intervals based on likelihood --- but we can also compare models and get $p$ values of specified null hypotheses. \end{itemize} \section{Definition and simple example} \newcommand{\lik}{{\cal L}} Definition: \emph{probability of a given set of data having occurred, given a particular hypothesis}: $\mbox{Prob}(D|H)=\lik(D,H)$\footnote{Because it can be reasonably said to be the ``likelihood of the hypothesis'', you will sometimes see the notation $\mbox{Prob}(D|H)=\lik(H|D)$. While this is technically consistent, I don't like it because it makes it easier to slip across the line to thinking that you are calculating $\mbox{Prob}(H|D)$.}. Read in a data set (collected by Caro Perez-Heydrich) that we will use for running examples. For 250 gopher tortoises captured in Florida, it documents various individual characteristics (sex/reproductive status, size, site and year) and the serological status (determined via ELISA test) with respect to mycoplasmal infection. Download the gopher tortoise data from \url{http://www.math.mcmaster.ca/bolker/eeid/ private/gophertortoise.txt} and save it in your working directory. <>= dat <- read.table("gophertortoise.txt",header=TRUE) @ <>= tt <- with(dat,table(size,status)) tt0 <- with(subset(dat,size>=230 & size<240),table(status)) neg0 <- tt0[1] pos0 <- tt0[2] tot0 <- sum(tt0) with(subset(dat,size>=230 & size<240),table(Sex,status)) with(subset(dat,size>=230 & size<240),table(Sex,status)) @ We'll start with a basic binomial model --- say our observations are $x$, the number of seropositive individuals out of a total of $N$ individuals. We'll suppose that (1) all individuals have the same \emph{per-individual} probability of infection $p$, and (2) all individuals are seropositive (or seronegative) independently. Then the distribution of the number seropositive will be binomial: \begin{equation} x \sim \mbox{Binomial}(p,N) \end{equation} or \begin{equation} \mbox{Prob}(x|p,N) = \binom{N}{x} p^x (1-p)^{N-x}. \end{equation} Since the joint probability of independent events equals the product of their probabilities, you can think of this as (probability of $x$ independent ``successes'' (positive tests), each with probability $p$) $\times$ (probability of $N-x$ independent ``failures'' (negative tests) each with probability $1-p$) --- times a complicated bit at the beginning which accounts for the order of events and makes the probabilities of any possible $x$ add up to 1.0 as they should\footnote{in practise, we can often ignore these \emph{normalization constants} when doing likelihood analysis, for reasons to be explained later}\footnote{one possible source of confusion here is that we have two probabilities --- one ``per-trial'' or ``per-capita'' probability, the probability that any \emph{individual} is seropositive, and the other the probability of the overall data set (i.e., the likelihood).} Of the $N=\Sexpr{tot0}$ individuals between 230 and 240 mm carapace length, there are $x=\Sexpr{pos0}$ seropositive and $N-x=\Sexpr{neg0}$ seronegative individuals\footnote{\code{with(subset(dat,size>=230 \& size<240),table(status))}}. In R, I will define these values as $N=$\code{tot0} (total) and $x=$\code{pos0} (seropositive). To calculate the probability of this outcome for a particular infection probability $p$, use \code{dbinom(pos0,size=tot0,prob=p)}; for example, \code{dbinom(pos0,size=tot0,prob=0)} is 0 (there is no chance of getting \Sexpr{pos0} successes if the per-trial probability is zero), while \code{dbinom(pos0,size=tot0,prob=0.84)} is \Sexpr{round(dbinom(pos0,size=tot0,prob=0.84),3)}. \section{Probability distributions and likelihood curves} \subsection{Probability distributions} There are two ways to examine the binomial probability model: in R, both use \code{dbinom()}. In the first, we can generate the \emph{probability distribution} of possible outcomes (from 0 to 31 in this case), for a fixed $p$\footnote{we use R's default point-based plot in this case to emphasize that only integer outcomes are possible, although we could use \code{type="b"} to guide our eye}: <>= xvec = 0:tot0 plot(xvec,dbinom(xvec,size=tot0,prob=0.84), xlab="Number seropositive (x)",ylab="Probability") abline(v=pos0) @ Note that while $x=\Sexpr{pos0}$ is the most likely outcome (with a probability of \Sexpr{round(dbinom(pos0,size=tot0,prob=0.84),3)}, getting 27 seropositive individuals is almost equally likely (\Sexpr{round(dbinom(pos0,size=tot0,prob=0.84),3)}). <>= extremes <- qbinom(c(0.01,0.99),prob=0.84,size=tot0) @ Extreme outcomes (say, $x<21$ or $x>30$) are very unlikely indeed. We can see just how extreme by plotting the log-probability:\footnote{This plot is almost the same as the one we would get for \code{plot(xvec,log(dbinom(xvec,size=tot0,prob=0.84)))}, but slightly more accurate for small probabilities. } <>= plot(xvec,dbinom(xvec,size=tot0,prob=0.84,log=TRUE), xlab="Number seropositive (x)",ylab="Log probability") abline(v=pos0) @ (Note that R uses natural logarithms for \code{log(x)} and for \code{log=TRUE}: use \code{log10(x)} to get base-10 logarithms.) Alternatively, we could <>= plot(xvec,dbinom(xvec,size=tot0,prob=0.84),log="y") @ which plots the probabilities on a logarithmic $y$ scale rather than calculating the log-probabilities. There is at least \emph{some} probability of getting no seropositive individuals with these parameters, even if it is on the order of $10^{\Sexpr{round(log10(dbinom(0,size=tot0,prob=0.84)))}}$ \ldots \subsection{Probability distributions: Examples} An important question at the start of any analysis is "which probability distribution/likelihood should I use?". In some cases, the process itself suggests a distribution; i.e. the binomial distribution describes processes that can be reasonable caricatured as coin-flipping trials -- e.g. the death or infection of discrete individuals. Some special distributions result as the consequence of aggregating many arbitrary random variables. Notably, the Normal and Lognormal distribution result from the sum and product, respectively, of many randome events. The more things in the collection, the better these distributions reflect the distribution of the aggregation. In general, we choose distributions/likelihoods because their properties (range, skewness) reflect the pattern of variation seen in the data. %\caption{A bestiary of probability distributions} \begin{tabular}{lllll} \hline \textbf{Distribution} & \textbf{Type} & \textbf{Range} & \textbf{Skew} & \textbf{Examples} \\ \hline Binomial & Discrete & 0,N & Any & coin flipping (\# of heads),\\ & & & & number surviving, cases\\ & & & & reported\\ Geometric & Discrete & 0,$\inf$ & Right & coin flipping (flips \\ & & & & until a head), discrete lifetimes\\ Poisson & Discrete & 0,$\inf$ & Right & Counts of rare events, cases\\ & & & & per day \\ Negative Binomial & Discrete & 0,$\inf$ & Right & Counts of aggregated\\ & & & & events, cases per day\\ Uniform & Continuous & -$\inf$,$\inf$ & None & \\ Normal & Continuous & -$\inf$,$\inf$ & None & Sum of arbitrary random \\ & & & & events, size, mass\\ Exponential & Continuous & 0,$\inf$ & Right & Survival time, infectious \\ & & & & duration\\ Gamma & Continuous & 0,$\inf$ & Right & Survival time, infectious duration\\ Beta & Continuous & 0,1 & Any & Random probabilities\\ Lognormal & Continuous & 0,$\inf$ & Right & Product of arbitrary random \\ & & & & events, size, mass\\ \hline \end{tabular} \subsection{Likelihood curves} In the second approach to examining the binomial probability model, rather than using $x$ as the dependent variable, assume that $x$ is known and plot the probability as function of $p$: <>= <> xlabstr <- "Per capita probability seropositive (p)" curve(dbinom(pos0,prob=x,size=tot0), from=0,to=1, xlab=xlabstr,ylab="Likelihood") abline(v=pos0/tot0,lty=2) mtext(side=3,at=pos0/tot0,expression(hat(p))) abline(h=dbinom(pos0,size=tot0,prob=pos0/tot0),lty=2) curve(-dbinom(pos0,prob=x,size=tot0,log=TRUE), from=0,to=1,ylim=c(0,30), xlab=xlabstr,ylab="Negative log-likelihood") abline(v=pos0/tot0,lty=2) mtext(side=3,at=pos0/tot0,expression(hat(p))) abline(h=-dbinom(pos0,size=tot0,prob=pos0/tot0,log=TRUE),lty=2) @ \begin{itemize} \item{the \emph{maximum likelihood estimate} (MLE) is the value of %% FIXME: add p-hat to graph the parameter that makes the data most likely to have occurred. In the particular case of binomial data, we could do a little bit of calculus to show that the MLE in this case is (number seropositive)/(total), which requires either common sense or a little bit of calculus (i.e. calculate $(d \log \lik/dp)$ and solve for $\hat p$ such that $(d \log \lik/dp)(\hat p)=0$).} \item{the \emph{maximum likelihood} or \emph{maximum log-likelihood} is the (log) probability of the data given that the parameter is set to the MLE. Perhaps surprisingly, this is usually not a number we care about much (but the ratios or differences between likelihoods are interesting).} \item{the \emph{likelihood curve} shows the likelihood for a range of possible parameter values. We often draw the negative log-likelihood curve (which has its \emph{minimum} in the same place, at $\hat p$), or the curve of $-2 \log \lik$ (called the \emph{deviance}), for various computational and historical reasons.} \end{itemize} \exercise. Use \code{dbinom} to calculate that the maximum likelihood is \Sexpr{round(dbinom(pos0,size=tot0,prob=pos0/tot0),3)}. \section{From data points to data sets} If we have more than one data point (we hope so!), and \emph{if} the data points are independent (which we usually hope, or assume \ldots), then according to basic probability theory the likelihood of the full data set is the product of the likelihoods for the individual points, and the (negative) log-likelihood is the sum of the (negative) log-likelihoods. Condense the gopher tortoise data set to numbers and numbers seropositive at each size (the \code{ddply} function in the \code{plyr} package takes a data frame, splits it into chunks according a specified variable, applies a specified function to condense each chunk, and sticks the chunks back together; in this case we are just computing the total number of individuals and the total number seropositive in each size category): <>= library(plyr) sizetab <- ddply(dat,"size", function(x) c(tot=nrow(x),pos=sum(x$status))) @ (When you do this yourself you should sanity-check the results with some combination of \code{summary()}, \code{head()}, \code{tail()}, and \code{str()} --- and even plot some summaries if you have time.) Calculate the overall fraction seropositive: <<>>= (prob0 <- with(sizetab,sum(pos)/sum(tot))) @ Calculate $-\log\lik$ for the whole dataset, if every individual had the same probability $p_0$ of seropositivity: $\sum_{a=1}^A \log \mbox{Binom}(x_a|p_0,N)$: <>= with(sizetab,-sum(dbinom(pos,size=tot,prob=prob0,log=TRUE))) @ (this corresponds to a tiny probability: more on this in a moment). Define an \R\ function to compute this value for any specified value of \code{prob}: <>= NLLfun_binom0 <- function(prob,dat=sizetab) { with(dat,-sum(dbinom(pos,size=tot,prob=prob,log=TRUE))) } @ Define a vector of probabilities from 0 to 1 and compute $-\log \lik$ for each (using \code{sapply} for compactness: we could also use a \code{for} loop). <<>>= pvec <- seq(0,1,length=201) NLLvec <- sapply(pvec,NLLfun_binom0) @ <>= ## slick vectorization stuff to hurt your brain if you look in here ... vNLLfun <- Vectorize(NLLfun_binom0,vectorize.args="prob") curve(vNLLfun_binom0(x),from=0,to=1,ylim=c(10,400)) tdat <- data.frame(tot=31,pos=26) abline(v=prob0,lty=2) curve(vNLLfun_binom0(x,dat=tdat),add=TRUE,col=2) abline(v=pos0/tot0,lty=2,col=2) minNLL0 <- min(NLLvec) minNLL1 <- NLLfun_binom0(pos0/tot0,tdat) @ Plot the likelihood curve: <>= <> @ <>= plot(pvec,NLLvec,type="l", xlab=xlabstr,ylab="Negative log likelihood") @ If we compare this plot with the previous curve (based only on the individuals between 230 and 240 mm), we see that: \begin{itemize} \item the minimum is much higher. Since we are looking at negative log-likelihood this means that the probability of the data is much \emph{lower}, by many orders of magnitude (113 log-likelihood units $\approx$ a factor of $10^{-50}$ (!)));the difference in height is not particularly important; with more data, the probability of any \emph{particular} outcome will decrease (getting \Sexpr{pos0} out of \Sexpr{tot0} is more probable than getting \Sexpr{sum(sizetab$pos)} out of \Sexpr{sum(sizetab$tot)}). \item the curve has its minimum (MLE) at a slightly different place. This is not surprising; the overall fraction seropositive (and hence the MLE) differs between the 230--240 mm individuals and the whole population (later on, the inferential tools that we develop will apply only to comparing different models to the same data set, not the same (or different) models to different data sets). \item The curve is much steeper. This is most important, because (as we will see) the width of the curve (inversely related to the steepness) determines the confidence interval. \end{itemize} In general the effect of increasing the size of a data set by a proportion $C$ (say doubling it) will be to (approximately) \emph{multiply} the entire log-likelihood curve by $C$, which increases both the minimum negative log likelihood and the steepness of the curve by a factor of $C$. \exercise. (1) use \verb+NLLfun_binom0+ to compute the likelihood curve for the 230--240 mm data (\emph{hint \#1}: construct a miniature data frame by taking the subset of \code{sizetab} that is includes sizes between 230 mm and 240 mm and use it in \verb+NLLfun_binom0+. \emph{hint \#2}: use the \code{ylim=} argument to \code{plot} to make sure the $y$-axis range is large enough to accommodate both curves. (2) subtract the minimum (\code{min()}) from each curve and plot the adjusted curves on the same plot to emphasize the difference in steepness rather than the difference in overall height. <>= NLLfun_binom0 <- function(prob,dat=sizetab) { with(dat,-sum(dbinom(pos,size=tot,prob=prob,log=TRUE))) } pvec <- seq(0,1,length=201) NLLvec <- sapply(pvec,NLLfun_binom0) plot(pvec,NLLvec,type="l", xlab=xlabstr,ylab="Negative log likelihood",ylim=c(0,800)) NLLfun_binom0 <- function(prob,dat=sizetab[50:59,]) { with(dat,-sum(dbinom(pos,size=tot,prob=prob,log=TRUE))) } pvec <- seq(0,1,length=201) NLLvec2 <- sapply(pvec,NLLfun_binom0) lines(pvec,NLLvec2,col=2) plot(pvec,NLLvec-min(NLLvec),type="l", xlab=xlabstr,ylab="Adjusted Negative log likelihood",ylim=c(0,200)) lines(pvec,NLLvec2-min(NLLvec2),col=2) @ \section{MLE fitting} In this case we know that the MLE $\hat p$ is equal to the overall fraction seropositive, but we can also use R to do minimize the negative log-likelihood for us. We have to specify a reasonable starting guess for the probability: for complex problems this can be tricky. \footnote{The best strategies for guessing reasonable starting values are (1) \textbf{know what the parameters of your model mean, and what units they are measured in} so that you can guess at reasonable orders of magnitude %% fixme epi example? and (2) if possible \textbf{do some initial graphical exploration} to confirm that your starting guesses are OK. At the very least, you should input your starting guesses into your negative log-likelihood function (e.g. \code{NLLfun\_binom0(startguess)}) to make sure that you get finite values \ldots} R has a function, \code{optim()}, that accesses a range of algorithms for finding the minimum of a given function. Here, we are looking to find the minimum of a function that returns the negative log-likelihood. In lay terms, a call to \code{optim} will make many repeated evaluations of the function to minimize until it finds the values that give the minimum. There are many possible arguments to \code{optim}, but the three most crucial are the starting value \code{par}, the function to minimize \code{fn}, and the algorithm to use \code{method}; here we use the \code{meth="Brent"}, which is specifically designed for optimizing a function with only one parameter (here p). \code{method="Brent"} additionally requires that we give lower and upper ranges over which to search for the minimum. The key outputs of \code{optim} are the parameter value that yields the minimum, \code{\$par}, and the value of the function at the minimum, \code{\$value}. <>= m0<-optim(par=05,fn=NLLfun_binom0,method="Brent",lower=0,upper=1) m0 @ While \code{optim} is generic, the \code{bbmle} package finds the MLE if we specify the negative log-likelihood function and a starting value. It does this by making a call to \code{optim}\footnote{Importantly, this means that you may need to look to the help files for \code{optim} for some issues do to with implementation of \code{mle2}} and formatting the outputs in useful ways (as we'll see): <<>>= library(bbmle) (m1 <- mle2(NLLfun_binom0,start=list(prob=0.5))) @ This command produces a bunch of warning messages. \textbf{It is OK to ignore warning messages and proceed if, AND ONLY IF, you know what they mean and you are confident that the condition that is triggering them is not affecting your answers in an important way. Always stop and figure out what warning messages mean before proceeding. Ask for help if necessary! } In this case they are caused because by the minimization function trying values of \code{prob} that don't make sense ($\leq 0$ or $\geq 1$). It's generally OK for the function to visit bad values on the way to its final answer, so you can usually ignore this particular type of warning provided that the final fit is reasonable. To be 100\% sure, you should re-do the fit in a way that eliminates the warnings (see exercise below). For simple problems like this one, you can also use a formula interface that constructs the negative log-likelihood function automatically. This is quicker than writing a new negative log-likelihood function every time you want to change the model a little bit, and enables some additional functionality (such as calculating predicted values of the responses or simulating from the fitted model, and making parameters vary across groups in a convenient way), but is sometimes impractical for more complex models. For example: <>= m1F <- mle2(pos~dbinom(prob=prob,size=tot), start=list(prob=0.5),data=sizetab) @ \begin{itemize} \item the left-hand side of the formula specifies the response variable (\code{pos} in this case) \item the right-hand side of the formula consists of a probability distribution or density function that R knows (see \code{?Distributions} for a long list), \emph{not} including the first argument (corresponding to the response) but including all the other parameters required to define the distribution. These can be expressed in terms of an arbitrary expression, say \code{prob=a+b*size} (see examples below) \item an optional (but highly recommended) \code{data} argument, usually a data frame, contains the variables that R will look for in computing $-\log \lik$ \item the \code{start} argument must contain values for all the parameters --- any variable in the formula that is not built into R or specified in \code{data} \end{itemize} This works easily, but it is actually a terrible model for the data: <>= with(sizetab,plot(size,pos/tot,cex=tot, xlab="size",ylab="status/fraction seropositive")) abline(h=prob0,col=2) @ It does not capture the obvious increase in the probability of seropositivity with size. A general point here: \textbf{Always look at pictures of your results! (1) Likelihood surface, or profiles (if feasible); (2) Best fit of model to data (with confidence intervals on predictions, if feasible)} \exercise. Redo the \code{mle2} fit in at least one of the following ways: \begin{itemize} \item{specify a minimization algorithm that allows minimization with \emph{box constraints} (i.e., set lower and/or upper bounds on parameters); one such example is the ``L-BFGS-B'' option in \code{mle2}. Specify \code{method="L-BFGS-B"} and use \code{lower=}, \code{upper=} (all as arguments --- i.e. within the \code{mle2} call). It is often useful to set the constraints slightly within the feasible bounds --- say \code{lower=0.002} and \code{upper=0.998} --- % rather than setting them exactly on the boundaries.\footnote{Some other options are (a) use \code{method="Brent"} with \code{lower0.002} and \code{upper=0.998} as in the call to \code{optim} (b) using \code{optimizer="optimize"} (a specialized, derivative-free minimizer for one-dimensional (single-parameter) problems) or (c) \code{optimizer="optimx", method="bobyqa"} (you will need to \code{library("optimx")} first; this is a derivative-free method that allows constraints).} } \item{alternatively, you can change the scale on which the parameters are estimated. For example, suppose that rather than using the probability of seropositivity as a parameter (which must be between 0 and 1), we use the \emph{logit} of the probability --- the logit, $log(p/(1-p))$ (\code{qlogis} in \R), can range from $-\infty$ to $\infty$. The logistic, $1/(1+\exp(-x))$ (\code{plogis} in \R), is the inverse transformation. Thus, we can use \code{logitprob} as our parameter and write our formula as \verb+pos~dbinom(prob=plogis(logitprob),size=tot)+. (This approach is often easier than introducing bounds, and may have other nice statistical properties\footnote{The negative log-likelihood surface may be closer to quadratic (see below for why this matters) for the transformed than for the original parameters}, but will behave badly in cases where the MLE is actually on the boundary.) The trickiest part to this is remembering that when you specify the starting value it must be sensible on the \emph{logit} scale.} \end{itemize} <>= ## does bobyqa work? #library(optimx) #mle2(pos~dbinom(prob=prob,size=tot),start=list(prob=0.5),data=sizetab, # optimizer="optimx",method="bobyqa",lower=0,upper=1) #mle2(pos~dbinom(prob=plogis(logitprob),size=tot),start=list(logitprob=qlogis(0.5)),data=sizetab) @ \section{Inference I: confidence intervals} Coming back to the likelihood curve: since height on the negative log-likelihood corresponds to decreasing goodness of fit, we can define confidence intervals by picking a reference level of ``badness of fit'', drawing a horizontal line at this level corresponding to fits that are a particular amount worse than the best fit, and finding the intersections of the curve with that reference level. One might pick a cutoff like 2 log-likelihood units (corresponding to fits that make the likelihood $e^2$ ($\approx$ 8) times worse than the maximum likelihood); one can also derive a frequentist confidence interval based on half of the 95\% upper critical value of the $\chi^2_1$ distribution, which turns out to be $\approx 1.92$. Confidence intervals derived in this way are called \emph{profile confidence intervals}, and you can get them via \code{confint(m1)}. \footnote{Profile confidence intervals can be difficult to compute for complex models. The \emph{Wald confidence intervals} are an easier-to-compute approximation to profile intervals. They assume the likelihood surface is quadratic, which can sometimes be a bad assumption. You can computed them via \code{confint(m1,method="quad")} --- they are also the basis of the $p$ values you get with \code{summary()}.} <>= plot(pvec,NLLvec,type="l", xlab=xlabstr,ylab="Negative log likelihood",ylim=c(114,120), xlim=c(0.6,0.9)) abline(h=c(min(NLLvec),min(NLLvec)+1.92),col=2,lty=2) cc <- confint(m1,quietly=TRUE) ## ccq <- confint(m1,method="quad") ## need quadratic approx ## curve(min(NLLvec)+((x-coef(m1))/(stdEr(m1)))^2/2,add=TRUE,col=4) abline(v=cc) ## abline(v=ccq,col=4,lty=2) ## legend("topright",c("profile","quad. approx"), ## lty=1,col=c(1,4),xpd=NA) @ \exercise. Add horizontal lines to the (fraction seropositive vs. size) figure above corresponding to the lower and upper profile confidence intervals for $p$ (\code{abline} may be helpful). In addition to computing confidence intervals, you can also use the \emph{Likelihood Ratio Test} (LRT) to test a particular null hypothesis. Null hypotheses correspond to null values of parameters; for example, to test the null hypothesis that $p_0=0.5$ we can calculate the (log-)likelihood ratio statistic $$ 2 (\lik(\hat p)-\lik(p_0)): $$ <>= (s <- 2*(c(logLik(m1))-(-NLLfun_binom0(0.5)))) ## test statistic @ (the \code{c} in the command above is used to throw out some extra junk that's associated with the \code{logLik} function; note also that we use \verb+-NLLfun_binom0+ to get the log-likelihood [the \emph{negative} of the negative log likelihood] \ldots) Next we compare it to the upper tail of the $\chi^2_1$ distribution: <>= pchisq(s,df=1,lower.tail=FALSE) @ We can reject this null hypothesis pretty conclusively \ldots this is equivalent to seeing that $p_0=0.5$ is well outside the 95\% confidence interval for $p$. Remember that the LRT is based only on \emph{differences} between log-likelihoods (i.e. likelihood ratios), not on the absolute value of the log-likelihood. This means that (1) the actual value of the minimum negative log-likelihood, which increases with data set size, is irrelevant; (2) when computing the log-likelihood, you can leave out the normalization constants (which generally depend on the data and not on the parameters), \emph{as long as you consistently exclude them for all the models you consider}. \section{Multi-parameter models: likelihood surfaces} Now let's make the model more realistic (i.e. allowing seropositivity to increase with size). We'll make the increase linear, but we have to do something to prevent unrealistic ($<0$ or $>1$) probabilities, so we will simply cut off the probabilities at 0.001 and 0.999. We start by defining a utility function that chops values at 0.001 and 0.999: <>= cfun <- function(x,min=0.001,max=0.999) { ifelse(xmax,max,x)) } @ The other trick that is generally useful is to center the \code{size} variable at some reasonable value, so that our intercept is meaningful --- we don't need to be predicting seropositivity for individuals of carapace length 0! Centering predictors generally improves both estimation and interpretability. <>= (m1L <- mle2(pos~dbinom(prob=cfun(a+b*(size-200)), size=tot),start=list(a=0.5,b=0.001),data=sizetab)) @ We can see that these answers are reasonable: 58\% seropositivity at size=200, and an increase of about 0.4\% in seropositivity per 1 mm increase (4\% per 10 mm increase may be easier to think about). We should plot the predicted graph. We could compute the predictions by hand fairly easily, but \code{mle2} also offers a \code{predict} function (but it only works with problems specified via the formula interface). We construct a new data frame specifying the variable ranges for which we want to compute predictions (we have to specify values for all of the variables used in the model, which includes \code{tot} in this case: we specify \code{tot}=1 here to get predictions for the probabilities): <>= pp1 <- data.frame(size=70:310,tot=1) pp1$pred <- predict(m1L,pp1) @ Plotting these predictions: <>= with(sizetab,plot(size,pos/tot,cex=tot, xlab="size",ylab="status/fraction seropositive")) with(pp1,lines(size,pred,col=2)) @ Now that we have two parameters (slope and intercept), the likelihood curve becomes a likelihood surface: <>= ## we could define the negative log-likelihood function: NLLfun_binom_lin <- function(a,b,dat=sizetab) { with(dat, { prob <-cfun(a+b*(size-200)) -sum(dbinom(pos,size=tot,prob=prob,log=TRUE)) }) } @ We'll use the \code{curve3d} function in the \code{emdbook} package as an easy way to construct a plot of the likelihood surface. (The less-magic way to do this would be to construct a matrix; use nested \code{for} loops to fill in the values; and use \code{image} and \code{contour} to draw the plots.) The one other bit of magic we need, because we never defined the negative log-likelihood function explicitly, is to use \code{m1L@minuslogl} to pull out the function from the fitted \code{mle2} object. We then use \code{contour} (with the object returned from \code{curve3d}, which is a list with elements \code{x}, \code{y}, and \code{z}) and \code{points} to overlay contours and the MLE on the plot. <>= ## hack to get useRaster=TRUE without having to explain it library(emdbook) cc <- curve3d(m1L@minuslogl(x,y), xlim=c(0.45,0.65),ylim=c(0.002,0.006), sys3d="image", xlab="intercept",ylab="slope", useRaster=TRUE) contour(cc$x,cc$y,cc$z,add=TRUE) points(coef(m1L)[1],coef(m1L)[2],pch=16) @ <>= library(emdbook) cc <- curve3d(m1L@minuslogl(x,y), xlim=c(0.45,0.65),ylim=c(0.002,0.006), sys3d="image", xlab="intercept",ylab="slope") contour(cc$x,cc$y,cc$z,add=TRUE) points(coef(m1L)[1],coef(m1L)[2],pch=16) @ Remember that every point on this surface represents a separate fit to the data: here $x$ is the intercept, $y$ is the slope, and $z$ (the height of the surface) is the negative log-likelihood (badness of fit). To get confidence intervals on individual parameters, we can ask R to compute (and plot) the \emph{likelihood profile} (a sort of cross-section of the likelihood surface, too complicated to explain here). R plots the square root of the change in log-likelihood, which will be symmetrical and {\sf V}-shaped if the surface is quadratic (not absolutely necessary but generally convenient): <>= <> @ <>= plot(profile(m1L)) @ % The quadratic approximation to the confidence interval is about % 5\% off for \code{b}: Compute confidence intervals based on these profiles: <<>>= confint(m1L) @ (you may get a warning here: it's OK to ignore it, especially since you've already looked at the profile to see if the answer was reasonable). The likelihood surface above is a bit wonky because of the sharp cutoffs in our likelihood. A more standard to way to model changes in probability is as a \emph{logistic} function, rather than a linear function --- we have already seen the \code{plogis} function, and it is easy to switch to this alternative: <<>>= m1L2 <- mle2(pos~dbinom(prob=plogis(a+b*(size-200)), size=tot),start=list(a=0.5,b=0.001),data=sizetab) @ The likelihood surface is much smoother, with elliptical contours indicating a quadratic surface: <>= <> NLLfun_binom_logist <- function(a,b,dat=sizetab) { with(dat, { prob <-plogis(a+b*(size-200)) -sum(dbinom(pos,size=tot,prob=prob,log=TRUE)) }) } cc2 <- curve3d(m1L2@minuslogl(x,y),xlim=c(-1,1),ylim=c(0.01,0.05), sys3d="image",useRaster=TRUE,ann=FALSE) mtext(side=1,line=2.5,"intercept",cex=1.5) mtext(side=2,line=3.5,"slope",las=0,cex=1.5) contour(cc2$x,cc2$y,cc2$z,add=TRUE) points(coef(m1L2)[1],coef(m1L2)[2],pch=16) @ The profile looks better too (more {\sf V}-shaped, more [although not entirely] symmetrical): <>= <> @ <>= plot(profile(m1L2)) @ Prediction: <<>>= pp2 <- pp1 pp2$pred <- predict(m1L2,pp2) @ A plot of the results: <>= <> with(sizetab,plot(size,pos/tot,cex=tot, xlab="size",ylab="status/fraction seropositive")) with(pp1,lines(size,pred,col=2)) with(pp2,lines(size,pred,col=4)) legend(200,0.3,c("linear","logistic"),col=c(2,4),lty=1) @ <>= ## this is not used m1L3 <- mle2(pos~dbinom(prob=cfun(1-exp(b*(size-size0))), size=tot),start=list(size0=90,b=0.001),data=sizetab) pp3 <- pp1 pp3$pred <- predict(m1L3,pp3) @ \section{Inference II: multi-variable models} We can use the LRT to do typical sorts of hypothesis testing. In general we can do hypothesis testing for \emph{nested} models, i.e. those where one model is a special case of the other, in which a smaller number of parameters is fitted --- often called the ``reduced'' (simpler) and ``full'' (more complex) models. For example, consider the logistic model above. A sensible question (although one we really don't need statistics for) is whether seropositivity changes significantly with size. This is equivalent to asking whether $b$ is significantly different from 0, \emph{or} whether the confidence interval of $b$ includes 0, \emph{or} whether the logistic model fits significantly better than the constant model we fitted above. A particularly simple way to do this comparison is using the \code{anova} function in R, using the two models as input\footnote{why \code{anova}? For classical linear models, the analogue of this sort of model comparison is ANOVA. Thus the \code{anova} function has been co-opted for model comparison in general: analysis of variance for linear models (\code{lm}), analysis of deviance for generalized linear models (\code{glm}), and likelihood ratio tests for general maximum likelihood fitting.}. <<>>= anova(m1L2,m1) @ If you compare nested models with more than one parameter held constant then we base inference $\chi^2_{\Delta p}$ where $\Delta p$ is the difference in the number of parameters. For example, if we were going to compare the two-parameter logistic model with a fixed seropositivity probability, e.g $p_0=0.5$, we would use the $\chi^2_2$ distribution (the \code{anova} function does this automatically). Another way to compare models, especially non-nested models, is the AIC (Akaike Information Criterion): $-2 \log \lik + 2 k$ where $k$ is the number of parameters. We don't have time to cover this in detail, except to say that \begin{itemize} \item like negative log-likelihoods, smaller is better (even when the AIC values are negative, smaller still means ``more negative'') \item some general rules of thumb are that $\Delta \mbox{AIC}<2$ is ``nearly equivalent''; $\Delta \mbox{AIC} >10$ is ``extremely different'' \item rather than the standard null-hypothesis testing framework of LRT (i.e. identifying deviations of a reference statistic that would be unlikely under some null hypothesis), AIC is based on estimating the model with the best expected predictive accuracy for future data sets \item like the LRT, AIC-based inference depends only on differences in AIC and not on the overall value of AIC \item R has the \code{AIC} function available for most models in R (including \code{mle2} fits, and the \code{AICtab} function in the \code{bbmle} package that provides a convenient way to compare AIC values. By default, \code{AICtab} reports only $\Delta$ AIC and the number of parameters (\code{df}). \end{itemize} <<>>= AICtab(m1,m1L,m1L2) @ \exercise. Re-do the fit with the function $p=1-e^{b(\mbox{\small size}-s_0)}$; wrap the whole thing in the \code{cfun} function defined above to prevent the probability going negative. What are reasonable starting values? Compare the fit to the other models fitted so far. <>= m1L3 <- mle2(pos~dbinom(prob=cfun(1-exp(b*(size-a))), size=tot),start=list(a=50,b=-0.001),data=sizetab) with(sizetab,plot(size,pos/tot,cex=tot, xlab="size",ylab="status/fraction seropositive")) abline(h=coef(m1),lty=2) lines(pp1$size,predict(m1L,pp1)) lines(pp1$size,predict(m1L2,pp1),col=2) lines(pp1$size,predict(m1L3,pp1),col=4) AICtab(m1,m1L,m1L2,m1L3) @ \section{Summary of methods for \code{mle2} fits} \begin{tabular}{ll} \hline \textbf{Function} & \textbf{Purpose} \\ \hline \code{coef(m1)} & extract coefficients (MLEs) \\ \code{summary(m1)} & summary information: coefficients, standard errors, Wald tests \\ \code{logLik(m1)} & log-likelihood \\ \code{deviance(m1)} & $-2 \log \lik$ \\ \code{AIC(m1)} & AIC \\ \code{AICtab(m1,m2,...)} & AIC table \\ \code{stdEr(m1)} & standard errors of coefficients \\ \code{vcov(m1)} & variance-covariance matrix of coefficients \\ \code{anova(m1,m2)} & likelihood ratio test \\ \code{profile(m1)} & calculate likelihood profiles \\ \code{confint(m1)} & profile confidence intervals \\ \code{confint(m1,method="quad")} & Wald confidence intervals \\ * \code{residuals(m1)} & residuals \\ * \code{predict(m1)} & predicted (fitted) values \\ * \code{predict(m1,newdata)} & predicted values for new set of predictors \\ * \code{simulate(m1)} & simulated values from the fitted model \\ \hline \end{tabular} * only available for models fitted with the formula interface <>= dev.off() @ % \bibliography{bookbib} \end{document}