\documentclass{article} \usepackage[american]{babel} \usepackage{graphicx} \usepackage{alltt} \usepackage{url} \usepackage{amsmath} \newcommand{\codef}[1]{{\tt #1}} \newcommand{\code}[1]{\texttt{#1}} \usepackage{sober} \newcounter{exercise} %\numberwithin{exercise}{section} \newcommand{\exnumber}{\addtocounter{exercise}{1} \theexercise \thinspace} \newcommand{\R}{{\sf R}} \title{Lab 6: estimation} \author{Ben Bolker} \begin{document} \bibliographystyle{plain} \maketitle \copyright 2005 Ben Bolker \section{Made-up data: negative binomial} The simplest thing to do to convince yourself that your attempts to estimate parameters are working is to simulate the ``data'' yourself and see if you get close to the right answers back. Start by making up some negative binomial ``data'': first, set the random-number seed so we get consistent results across different \R\ sessions: <<>>= set.seed(1001) @ Generate 50 values with $\mu=1$, $k=0.4$ (save the values in variables in case we want to use them again later, or change the parameters and run the code again): <<>>= mu.true=1 k.true=0.4 x = rnbinom(50,mu=mu.true,size=k.true) @ Take a quick look at what we got: <>= plot(table(factor(x,levels=0:max(x))), ylab="Frequency",xlab="x") @ (reminder: I won't always draw the pictures, but it's good to make a habitat of examining your variables (with \code{summary()} etc. and graphically) as you go along to make sure you know what's going on!) Negative log-likelihood function for a simple draw from a negative binomial distribution: the first parameter, \code{p}, will be the vector of parameters, and the second parameter, \code{dat}, will be the data vector (in case we want to try this again later with a different set of data; by default we'll set it to the \code{x} vector we just drew randomly). <<>>= NLLfun1 = function(p,dat=x) { mu=p[1] k=p[2] -sum(dnbinom(x,mu=mu,size=k,log=TRUE)) } @ Calculate the negative log-likelihood for the true values. I have to combine these values into a vector with \code{c()} to pass them to the negative log-likelihood function. Naming the elements in the vector is optional but will help keep things clear as we go along: <<>>= nll.true=NLLfun1(c(mu=mu.true,k=k.true)); nll.true @ The NLL for other parameter values that I know are way off ($\mu=10$, $k=10$): <<>>= NLLfun1(c(mu=10,k=10)) @ Much higher negative log-likelihood, as it should be. Find the method-of-moments estimates for $\mu$ and $k$: <<>>= m = mean(x) v = var(x) mu.mom = m k.mom = m/(v/m-1) @ Negative log-likelihood estimate for method of moments parameters: <<>>= nll.mom=NLLfun1(c(mu=mu.mom,k=k.mom)); nll.mom @ Despite the known bias, this estimate is better (lower negative log-likelihood) than the ``true'' parameter values. The Likelihood Ratio Test would say, however, that the difference in likelihoods would have to be greater than $\chi^2_2(0.95)/2$ (two degrees of freedom because we are allowing both $\mu$ and $k$ to change): <<>>= ldiff=nll.true-nll.mom;ldiff qchisq(0.95,df=2)/2 @ So --- better, but not significantly better at $p=0.05$. (\code{pchisq(2*ldiff,df=2,lower.tail=FALSE)} would tell us the exact $p$-value if we wanted to know.) But what is the MLE? Use \code{optim} with the default options (Nelder-Mead simplex method) and the method-of-moments estimates as the starting estimates (\code{par}): <<>>= O1 = optim(fn=NLLfun1,par=c(mu=mu.mom,k=k.mom),hessian=TRUE); O1 @ The optimization result is a list with elements: \begin{itemize} \item{the best-fit parameters (\verb+O1$par+, with parameter names because we named the elements of the starting vector---see how useful this is?);} \item{the minimum negative log-likelihood (\verb+O1$value+);} \item{information on the number of function evaluations (\verb+O1$counts+; the \code{gradient} part is \code{NA} because we didn't specify a function to calculate the derivatives (and the Nelder-Mead algorithm wouldn't have used them anyway);} \item{information on whether the algorithm thinks it found a good answer (\verb+O1$convergence+, which is zero if \R\ thinks everything worked and uses various numeric codes (see \code{?optim} for details) if something goes wrong;} \item{\verb+O1$message+ which may give further information about the when the fit converged or how it failed to converge;} \item{because we set \code{hessian=TRUE}, we also get \verb+O1$hessian+, which gives the (finite difference approximation of) the second derivatives evaluated at the MLE} \end{itemize} The minimum negative log-likelihood (\Sexpr{round(O1$value,2)}) is better than either the negative log-likelihood corresponding to the method-of-moments parameters (\Sexpr{round(nll.mom,2)}) or the true parameters (\Sexpr{round(nll.true,2)}), but all of these are within the LRT cutoff. Now let's find the likelihood surface, the profiles, and the confidence intervals. The likelihood surface is straightforward: set up vectors of $\mu$ and $k$ values and run \code{for} loops, set up a matrix to hold the results, and run \code{for} loops to calculate and store the values. Let's try $\mu$ from 0.4 to 3 in steps of 0.05 and $k$ from 0.01 to 0.7 in steps of 0.01. (I initially had the $\mu$ vector from 0.1 to 2.0 but revised it after seeing the contour plot below.) <<>>= muvec = seq(0.4,3,by=0.05) kvec = seq(0.01,0.7,by=0.01) @ The matrix for the results will have rows corresponding to $\mu$ and columns corresponding to $k$: <<>>= resmat = matrix(nrow=length(muvec),ncol=length(kvec)) @ Run the \code{for} loops: <<>>= for (i in 1:length(muvec)) { for (j in 1:length(kvec)) { resmat[i,j] = NLLfun1(c(muvec[i],kvec[j])) } } @ Drawing a contour: the initial default choice of contours doesn't give us fine enough resolution (it picks contours spaced 5 apart to cover the range of the values in the matrix), so I added levels spaced 1 log-likelihood unit apart from 70 to 80 by doing a second \code{contour} plot with \code{add=TRUE}. <>= contour(muvec,kvec,resmat,xlab=expression(mu),ylab="k") contour(muvec,kvec,resmat,levels=70:80,lty=2,add=TRUE) @ Alternately, we could set the levels of the contour plot corresponding to different $1-\alpha$ levels for the likelihood ratio test: if the minimum negative log-likelihood is $m$, then these levels are $m+\chi^2_2(\alpha)/2$: <>= alevels=c(0.5,0.9,0.95,0.99,0.999) minval = O1$value nll.levels = qchisq(alevels,df=2)/2+minval contour(muvec,kvec,resmat,levels=nll.levels,labels=alevels, xlab=expression(mu),ylab="k") @ So far, so good. Finding the profiles and confidence limits is a bit harder. To calculate the $\mu$ profile, define a new function that takes $\mu$ as a \emph{separate} parameter (which \code{optim} will not try to adjust as it goes along) and optimizes with respect to $k$: <<>>= NLLfun.mu = function(p,mu) { k = p[1] -sum(dnbinom(x,mu=mu,size=k,log=TRUE)) } @ Set up a matrix with two columns (one for the best-fit $k$ value, one for the minimum negative log-likelihood achieved): <<>>= mu.profile = matrix(ncol=2,nrow=length(muvec)) @ Run a \code{for} loop, starting the optimization from the maximum-likelihood value of $k$ every time. Also include the value for $\mu$ (\code{mu=muvec[i]}), which \R\ will pass on to the function that computes the negative log-likelihood. The default Nelder-Mead method doesn't work well on 1-dimensional problems, and will give a warning. I tried \code{method="BFGS"} instead but got warnings about \code{NaNs produced in \ldots}, because \code{optim} tries some negative values for $k$ on its way to the correct (positive) answer. I then switched to \code{L-BFGS-B} and set \code{lower=0.002}, far enough above zero that \code{optim} wouldn't run into any negative numbers when it calculated the derivatives by finite differences. Another option would have been to change the function around so that it minimized with respect to $\log(k)$ instead of $k$. This function would look something like: <<>>= NLLfun.mu2 = function(p,mu) { logk = p[1] k = exp(logk) -sum(dnbinom(x,mu=mu,size=k,log=TRUE)) } @ and of course we would have to translate the answers back from the log scale to compare them to the results so far. (The other option would be to use \code{optimize}, a function specially designed for 1D optimization, but this way we have to do less rearranging of the code.) A general comment about warnings: it's OK to ignore warnings \textbf{if you understand exactly where they come from and have satisfied yourself that whatever problem is causing the warnings does not affect your answer in any significant way}. Ignoring warnings at any other time is a good way to overlook bugs in your code or serious problems with numerical methods that will make your answers nonsensical. So, anyway --- run that optimization for each value of $\mu$ in the $\mu$ vector. At each step, save the parameter estimate and the minimum negative log-likelihood: <<>>= for (i in 1:length(muvec)) { Oval = optim(fn=NLLfun.mu,par=O1$par["k"],method="L-BFGS-B", lower=0.002,mu=muvec[i]) mu.profile[i,] = c(Oval$par,Oval$value) } colnames(mu.profile) = c("k","NLL") @ Do the same process for $k$: <<>>= NLLfun.k = function(p,k) { mu = p[1] -sum(dnbinom(x,mu=mu,size=k,log=TRUE)) } k.profile = matrix(ncol=2,nrow=length(kvec)) for (i in 1:length(kvec)) { Oval = optim(fn=NLLfun.k,par=O1$par["mu"],method="L-BFGS-B", lower=0.002,k=kvec[i]) k.profile[i,] = c(Oval$par,Oval$value) } colnames(k.profile) = c("mu","NLL") @ Redraw the contour plot with profiles added: <>= contour(muvec,kvec,resmat,xlab=expression(mu),ylab="k") contour(muvec,kvec,resmat,levels=70:80,lty=2,add=TRUE) lines(muvec,mu.profile[,"k"],lwd=2) lines(k.profile[,"mu"],kvec,lwd=2,lty=2) @ The contour for $\mu$ is completely independent of $k$: no matter what value you choose for $k$, the best estimate of $\mu$ is still the same (and equal to the mean of the observed data). In contrast, values of $\mu$ either above or below the best value lead to estimates of $k$ that are lower than the MLE. \textbf{Exercise \exnumber}: Redraw the contour plot of the likelihood surface for this data set with the contours corresponding to $\alpha$ levels, as above. Add points corresponding to the location of the MLE, the method-of-moments estimate, and the true values. State your conclusions about the differences among these 3 sets of parameters and their statistical significance. \section{Univariate profiles} Now we'd like to find the univariate confidence limits on $\mu$ and $k$. It's easy enough to get an approximate idea of this graphically. For example, for $\mu$, plotting the profile and superimposing horizontal lines for the minimum NLL and the 95\% and 99\% LRT cutoffs: <>= plot(muvec,mu.profile[,"NLL"],type="l", xlab=expression(mu),ylab="Negative log-likelihood") cutoffs = c(0,qchisq(c(0.95,0.99),1)/2) nll.levels = O1$value+cutoffs abline(h=nll.levels,lty=1:3) text(rep(0.5,3),nll.levels+0.2,c("min","95%","99%")) @ But how do we find the $x$-intercepts ($\mu$ values) associated with the points where the likelihood profile crosses the cutoff lines? Three possibilities: \begin{enumerate} \item{If we have sampled the profile at a fairly fine scale, we can just look for the point(s) that are closest to the cutoff value: <<>>= cutoff = O1$value+qchisq(0.95,1)/2 @ The \code{which.min} function gives the index of the smallest element in the vector: if we find the index corresponding to the smallest value of the absolute value of the profile negative log-likelihood minus the cutoff, that should give us the $\mu$ value closest to the confidence limit. We actually need to to do this for each half of the curve separately. First the lower half, selecting values from \code{muvec} and \code{mu.profile} corresponding to values less than 1.2 (based on looking at the plot. <<>>= lowerhalf = mu.profile[muvec<1.2,"NLL"] lowerhalf.mu = muvec[muvec<1.2] w.lower = which.min(abs(lowerhalf-cutoff)) @ The same thing for the upper half of the curve: <<>>= upperhalf = mu.profile[muvec>1.2,"NLL"] upperhalf.mu = muvec[muvec>1.2] w.upper = which.min(abs(upperhalf-cutoff)) ci.crude = c(lowerhalf.mu[w.lower],upperhalf.mu[w.upper]) @ Plot it: <>= plot(muvec,mu.profile[,"NLL"],type="l", xlab=expression(mu),ylab="Negative log-likelihood") cutoffs = c(0,qchisq(c(0.95),1)/2) nll.levels = O1$value+cutoffs abline(h=nll.levels,lty=1:2) abline(v=ci.crude,lty=3) @ You can see that it's not \emph{exactly} on target, but very close. If you wanted to proceed in this way and needed a more precise answer you could ``zoom in'' and evaluate the profile on a finer grid around the lower and upper confidence limits. } \item{ You can set up an automatic search routine in \R\ to try to find the confidence limits. First, define a function that takes a particular value of $\mu$, optimizes with respect to $k$, and returns the value of the negative log-likelihood \emph{minus the cutoff value}, which tells us how far above or below the cutoff we are. <<>>= cutoff = O1$value+qchisq(c(0.95),1)/2 relheight = function(mu) { O2 = optim(fn=NLLfun.mu,par=O1$par["k"],method="L-BFGS-B", lower=0.002,mu=mu) O2$value-cutoff } @ We know the lower limit is somewhere around 0.7, so going on either side should give us values that are negative/positive. <<>>= relheight(mu=0.6) relheight(mu=0.8) @ Using \R's \code{uniroot} function, which takes a single-parameter function and searches for a value that gives zero: <<>>= lower = uniroot(relheight,interval=c(0.5,1.0))$root upper = uniroot(relheight,interval=c(1.2,5))$root ci.uniroot= c(lower,upper) @ <>= plot(muvec,mu.profile[,"NLL"],type="l", xlab=expression(mu),ylab="Negative log-likelihood") cutoffs = c(0,qchisq(c(0.95),1)/2) nll.levels = O1$value+cutoffs abline(h=nll.levels,lty=1:2) abline(v=ci.crude,lty=3) abline(v=ci.uniroot,lty=4) @ Slightly more precise than the previous solution. } \item{ Using the information-based approach. Here is the information matrix: <<>>= O1$hessian @ Inverting the information matrix: <<>>= s1 = solve(O1$hessian); s1 @ You can see that the off diagonal elements are very small ($-6 \times 10^{-6}$ as opposed to 0.0102 and 0.135 on the diagonal), correctly suggesting that the parameter estimates are uncorrelated. Suppose we want to approximate the likelihood profile by the quadratic $L=a+c(\mu-b)^2$. The parameter $b$ governs the $\mu$ value at which the minimum occurs (so $b$ corresponds to the MLE of $\mu$) and parameter $a$ governs the height of the minimum (so $a$ is the minimum negative log-likelihood). The second derivative of the quadratic is $2c$; the second derivative of the likelihood surface is $\partial^2 L/\partial \mu^2$, so $c=(\partial^2 L/\partial \mu^2)/2$. <<>>= a=O1$value b=O1$par["mu"] c=O1$hessian["mu","mu"]/2 @ We get the variances of the parameters ($\sigma^2_\mu$, $\sigma^2_k$) by inverting the information matrix. The size of the confidence limit is $\pm 1.96 \sigma_\mu$ for 95\% confidence limits, or more generally $\pm$ \code{qnorm(1-alpha/2)} $\sigma_\mu$ (\code{qnorm} gives the quantile of the standard normal, with mean 0 and standard deviation 1, by default). <<>>= se.mu = sqrt(s1["mu","mu"]) ci.info = O1$par["mu"]+c(-1,1)*qnorm(0.975)*se.mu @ Double plot, showing a close-up of the negative log-likelihood minimum (to convince you that the quadratic approximation really is a good fit if you go close enough to the minimum --- this is essentially a Taylor expansion, provided that the likelihood surface is reasonably well-behaved) and a wider view comparing the quadratic approximation and the confidence limits based on <>= op=par(mfrow=c(1,2)) plot(muvec,mu.profile[,"NLL"],type="l", xlab=expression(mu),ylab="Negative log-likelihood",ylim=c(71.8,72.2), xlim=c(0.7,1.7)) curve(a+c*(x-b)^2,add=TRUE,lty=2) ## plot(muvec,mu.profile[,"NLL"],type="l", xlab=expression(mu),ylab="Negative log-likelihood") cutoffs = c(0,qchisq(c(0.95),1)/2)+O1$value curve(a+c*(x-b)^2,add=TRUE,lty=2) abline(h=cutoffs) abline(v=ci.info,lty=3) abline(v=ci.uniroot,lty=1) par(op) @ } \item{ One way to cheat: use the \code{fitdistr} function from the \code{MASS} package instead (this only works for simple draws from distributions). <<>>= library(MASS) f=fitdistr(x,"negative binomial"); f @ \code{fitdistr} gives the parameters in the other order --- \code{size} and \code{mu} rather than \code{mu} and \code{k} as I have been naming them. It gives standard errors \emph{based on the quadratic approximation}, so the same as by the previous method. (I had to use \code{str(f)} to look inside \code{f} and figure out how to extract the numbers I wanted.) <<>>= ci.fitdistr = f$estimate["mu"]+c(-1,1)*f$sd["mu"]*qnorm(0.975) @ } \item{The last option is to use the \code{mle} function from the \code{stats4} package to find the confidence intervals. To do this, we have to rewrite the NLL function slightly differently: (1) specify the parameters separately, rather than packing them into a parameter vector (and then unpacking them inside the function --- so this is actually slightly more convenient), and (2) you can't pass the data as additional argument: if you want to run the function for another set of data you have to replace \code{x} with your new data, or write another function (there are other kinds of black magic you can do to achieve the same goal, but they are too complicated to lay out here (P2C2E \cite{Rushdie1999})). <<>>= NLLfun2 = function(mu,k) { -sum(dnbinom(x,mu=mu,size=k,log=TRUE)) } @ \code{mle} has slightly different argument names from \code{optim}, but you still have to specify the function and the starting values, and you can include other options (which get passed straight to \code{optim}): <<>>= library(stats4) m1 = mle(minuslogl=NLLfun2,start=list(mu=mu.mom,k=k.mom), method="L-BFGS-B",lower=0.002); m1 @ \code{summary(m1)} gives the estimates and the standard errors based on the quadratic approximation: <<>>= summary(m1) @ \code{confint(m1)} gives the profile confidence limits, for all parameters (the second line below takes just the row corresponding to \code{mu}). <<>>= ci.mle.all = confint(m1); ci.mle.all ci.mle = ci.mle.all["mu",] @ The \code{confint} code is based on first calculating the profile (as we did above, but at a smaller number of points), then using spline-based interpolation to find the intersections with the cutoff height. } \end{enumerate} Comparing all of these methods: <<>>= citab=rbind(ci.crude,ci.uniroot,ci.mle,ci.info,ci.fitdistr); citab @ \section{Reef fish: settler distribution} Let's simulate the reef fish data again: <<>>= a = 0.696 b = 9.79 recrprob = function(x,a=0.696,b=9.79) a/(1+(a/b)*x) scoefs = c(mu=25.32,k=0.932,zprob=0.123) settlers = rzinbinom(603,mu=scoefs["mu"],size=scoefs["k"],zprob=scoefs["zprob"]) recr = rbinom(603,prob=recrprob(settlers),size=settlers) @ Set up likelihood functions --- let's say we know the numbers of settlers and are trying to estimate the recruitment probability function. I'll use \code{mle}. First, a Shepherd function: <<>>= NLLfun3 = function(a,b,d) { recrprob = a/(1+(a/b)*settlers^d) -sum(dbinom(recr,prob=recrprob,size=settlers,log=TRUE),na.rm=TRUE) } @ <<>>= NLLfun4 = function(a,b) { recrprob = a/(1+(a/b)*settlers) -sum(dbinom(recr,prob=recrprob,size=settlers,log=TRUE),na.rm=TRUE) } @ <<>>= NLLfun5 = function(a) { recrprob = a -sum(dbinom(recr,prob=recrprob,size=settlers,log=TRUE),na.rm=TRUE) } @ I ran into a problem with this log-likelihood function: when settlers=0, the recruitment probability is $a$, which may be greater than 0. This doesn't matter in reality because when there are zero settlers there aren't any recruits. From a statistical point of view, the probability of zero recruits is 1 when there are zero settlers, but \code{dbinom(x,prob=a,size=0)} still comes out with \code{NaN}. However, \textbf{since I know what the problem is} and I know that the log-likelihood in this case is $\log(1)=0$ and so contributes nothing to the log-likelihood, I am safe using the \code{na.rm=TRUE} option for \code{sum}. However, this does still gives me warnings about \code{NaNs produced in: dbinom \ldots}. A better solution might be to drop the zero-settler cases from the data entirely: <<>>= recr = recr[settlers>0] settlers = settlers[settlers>0] @ <>= plot(settlers,recr) abline(h=10) abline(a=0,b=0.5) @ Looking at a plot of the data, I can eyeball the asymptote ($b$) at about 10 recruits, the initial slope ($a$) at about 0.5, and I'll start with $d=1$. Had to mess around a bit --- fitted the simpler (Beverton-Holt) model first, and found I had to use \code{L-BFGS-B} to get \code{mle} not to choke while computing the information matrix. <<>>= m4 = mle(minuslogl=NLLfun4,start=list(a=0.5,b=10), method="L-BFGS-B",lower=0.003) s3 = list(a=0.684,b=10.161,d=1) m3 = mle(minuslogl=NLLfun3,start=s3, method="L-BFGS-B",lower=0.003) m5 = mle(minuslogl=NLLfun5,start=list(a=0.5), method="L-BFGS-B",lower=0.003) @ Plot all fits against the data: <>= plot(settlers,recr) a=coef(m5)["a"] curve(a*x,add=TRUE,lty=3) a=coef(m4)["a"]; b=coef(m4)["b"] curve(a*x/(1+(a/b)*x),add=TRUE,lty=2) a=coef(m5)["a"]; b=coef(m5)["b"]; d=coef(m5)["d"] curve(a*x/(1+(a/b)*x^d),add=TRUE,lty=3) @ Compare the negative log-likelihoods: <<>>= nll = c(shep=-logLik(m3),BH=-logLik(m4),densind=-logLik(m5)); nll @ As required, the Shepherd has a better likelihood than the Beverton-Holt, but only by a tiny amount --- certainly not greater than the LRT cutoff. On the other hand, there is \emph{extremely} strong support for density-dependence, from the \Sexpr{round(nll[3]-nll[2])} log-likelihood-unit differences between the density-independent and Beverton-Holt model likelihoods. It's silly, but you can calculate the logarithm of the $p$-value as: <<>>= logp=pchisq(2*nll[3]-nll[2],1,lower.tail=FALSE,log.p=TRUE) logp @ That's equivalent to $p \approx 10^{-\Sexpr{round(logp)/log(10)}}$ (there's a \emph{much} higher probability that the CIA broke into your computer and tampered with your data). We would get the same answer by calculating AIC values: <<>>= npar = c(5,4,3) aic = nll+2*npar; aic @ Or BIC values: <<>>= ndata = length(recr) bic = nll+log(ndata)*npar;bic @ (except that BIC says more strongly that the Shepherd is not worth considering in this case). The confidence intervals tell the same story: <<>>= confint(m3) @ the 95\% confidence intervals of $d$ include 1 (just). <<>>= confint(m4) @ We get tighter bounds on $a$ with the Beverton-Holt model (since we are not wasting data trying to estimate $d$), and we get reasonable bounds on $b$. (When I used \code{BFGS} instead of \code{L-BFGS-B}, even though I got reasonable answers for the MLE, I ran into trouble when I tried to get profile confidence limits.) In particular, the upper confidence interval of $b$ is $<\infty$ (which it would not be if the density-independent model were a better fit). \textbf{Exercise\exnumber:} redo the fitting exercise, but instead of using \code{L-BFGS-B}, fit log $a$, log $b$, and log $d$ to avoid having to use constrained optimization. Back-transform your answers for the point estimates and the confidence limits by exponentiating; make sure they check closely with the answers above. \textbf{Exercise\exnumber:} fit zero-inflated negative binomial, negative binomial, and Poisson distributions to the settler data. State how the three models are nested; perform model selection by AIC and LRT, and compare the confidence intervals of the parameters. What happens if you simulate half as much ``data''? \bibliography{bookbib} \end{document}